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Preface 


We are proud to present the proceedings of the 18th International Symposium on 
Intelligent Data Analysis IDA 2020), which was held during April 27—29, 2020, in 
Konstanz, Germany. The first symposium of this series was organized in 1995 and held 
biannually until 2009, when the conference switched to being held annually. Following 
demand expressed by the IDA community in a survey held in 2018, IDA 2020 was the 
first of the series to take place in spring rather than fall, as was common before. 

The switch to April, and a more organized outreach to the community, coincided 
with an increase in the number of submissions from 65 in 2018, to 114 in 2020. After a 
rigorous review process, 45 of these 114 submissions were accepted for presentation. 
Almost all submissions were reviewed by at least three Program Committee 
(PC) members (only two papers had two reviews) and a substantial number of sub- 
missions received more than three reviews. In addition to the PC, the review process 
also involved program chair advisors — a select set of senior researchers with a 
multi-year involvement in the IDA symposium series. Whenever a program chair 
advisor flagged a paper with an informed, thoughtful, positive review due to the paper 
presenting a particularly interesting and novel idea, the paper was accepted irrespective 
of the other reviews. Each accepted paper was offered a slot for either oral presentation 
(15 papers) or poster presentation (30 papers). 

We wish to express our gratitude to the authors of all submitted papers for their 
high-quality contributions; to the PC members and additional reviewers for their efforts 
in reviewing, discussing, and commenting on all submitted papers; to the program chair 
advisors for their active involvement; and to the IDA council for their ongoing guid- 
ance and support. Many people have helped behind the scenes to make IDA 2020 
possible, but this year we are particularly grateful to our publicity chairs who helped 
spread the word: Daniela Gawehns and Hugo Manuel Proenga! 
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Multivariate Time Series as Images: 
Imputation Using Convolutional 
Denoising Autoencoder 


Abdullah Al Safi, Christian Beyer), Vishnu Unnikrishnan, 
and Myra Spiliopoulou 


Fakultät für Informatik, Otto-von-Guericke-Universität, 
Postfach 4120, 39106 Magdeburg, Germany 
abdullah.safi@st.ovgu.de, 
{christian.beyer,vishnu.unnikrishnan ,myra}@ovgu.de 


Abstract. Missing data is a common occurrence in the time series 
domain, for instance due to faulty sensors, server downtime or patients 
not attending their scheduled appointments. One of the best methods to 
impute these missing values is Multiple Imputations by Chained Equa- 
tions (MICE) which has the drawback that it can only model linear rela- 
tionships among the variables in a multivariate time series. The advance- 
ment of deep learning and its ability to model non-linear relationships 
among variables make it a promising candidate for time series imputa- 
tion. This work proposes a modified Convolutional Denoising Autoen- 
coder (CDA) based approach to impute multivariate time series data 
in combination with a preprocessing step that encodes time series data 
into 2D images using Gramian Angular Summation Field (GASF). We 
compare our approach against a standard feed-forward Multi Layer Per- 
ceptron (MLP) and MICE. All our experiments were performed on 5 
UEA MTSC multivariate time series datasets, where 20 to 50% of the 
data was simulated to be missing completely at random. The CDA model 
outperforms all the other models in 4 out of 5 datasets and is tied for 
the best algorithm in the remaining case. 


Keywords: Convolutional Denoising Autoencoder - Gramian Angular 
Summation Field - MICE - MLP. - Imputation - Time series 


1 Introduction 


Time series data resides in various domains of industries and research fields 
and is often corrupted with missing data. For further use or analysis, the data 
often needs to be complete, which gives the rise to the need for imputation 
techniques with enhanced capabilities of introducing least possible error into 
the data. One of the most prominent imputation methods is MICE which uses 
iterative regression and value replacement to achieve state-of-the-art imputation 
quality but has the drawback that it can only model linear relationships among 
variables (dimensions). 

© The Author(s) 2020 
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In past few years, different deep learning architectures were able to break into 
different problem domains, often exceeding previously achieved performances by 
other algorithms [7]. Areas like speech recognition, natural language process- 
ing, computer vision, etc. were greatly impacted and improved by deep learning 
architectures. Deep learning models have a robust capability of modelling latent 
representation of the data and non-linear patterns, given enough training data. 
Hence, this work presents a deep learning based imputation model called Con- 
volutional Denoising Autoencoder (CDA) with altered convolution and pooling 
operations in Encoder and Decoder segments. Instead of using the traditional 
steps of convolution and pooling, we use deconvolution and upsampling which 
was inspired by [5]. The time series to image transformation mechanisms pro- 
posed in [12] and [13] were inherited as a preprocessing step as CDA models 
are typically designed for images. As rival imputation models, Multiple Imputa- 
tion by Chained Equations (MICE) and a Multi Layer Perceptron (MLP) based 
imputation were incorporated. 


2 Related Work 


Three distinct types of missingness in data were identified in [8]. The first one 
is Missing Completely At Random (MCAR), where the missingness of the data 
does not depend on itself or any other variables. In Missing At Random (MAR) 
the missing value depends on other variables but not on the variable where the 
data is actually missing and in Missing Not At Random (MNAR) the missingness 
of an observation depends on the concerned variable itself. All the experiments 
in this study were carried out on MCAR missingness as reproducing MAR and 
MNAR missingness can be challenging and hard to distinguish [5]. 

Multiple Imputation by Chained Equations (MICE) has secured its place as 
a principal method for imputing missing data [1]. Costa et al. in [3] experimented 
and showed that MICE offered the better imputation quality than a Denoising 
Autoencoder based model for several missing percentages and missing types. 

A novel approach was proposed in [14], incorporating General Adversarial 
Networks (GAN) to perform imputations, thus authors named it Generative 
Adversarial Imputation Nets (GAIN). The approach imputed significantly well 
against some state-of-the-art imputation methods including MICE. An Autoen- 
coder based approach was proposed in [4], which was compared against an Arti- 
ficial Neural Network (NN) model on MCAR missing type and several missing 
percentages. The proposed model performed well against NN. A novel Denoising 
Autoencoder based imputation using partial loss (DAPL) approach was pre- 
sented in [9], where different missing data percentages and MCAR missing type 
were simulated in a breast cancer dataset. The comparisons incorporated sta- 
tistical, machine learning based approaches and standard Denoising Autoen- 
coder (DAE) model where DAPL outperformed DAE and all the other models. 
An MLP based imputation approach was presented for MCAR missingness in 
[10] and also outperformed other statistical models. A Convolutional Denois- 
ing Autoencoder model which did not impute missing data but denoised audio 
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signals was presented in [15]. A Denoising Autoencoder with more units in the 
encoder layer than input layer was presented in [5] and achieved good impu- 
tation results against MICE. Our work was inspired from both of these works 
which is why we combined the two approaches into a Convolutional Denoising 
Autoencoder which maps input data into a higher subspace in the Encoder. 


3 Methodology 


In this section we first describe how we introduce missing data in our datasets, 
then we show the process used to turn multivariate time series into images 
which is required by one of our imputation methods and finally we introduce the 
imputation methods which were compared in this study. 


3.1 Simulating Missing Data 


Simulating missing data is a mechanism of artificially introducing unobserved 
data into a complete time series dataset. Our experiment incorporated 20%, 
30%, 40% and 50% of missing data and the missing type was MCAR. Introducing 
MCAR missingness is quite a simple approach as it does not depend on observed 
or unobserved data. Many studies assume MCAR missing type quite often when 
there is no concrete evidence of missingness type [6]. In this experimental frame- 
work, values at randomly selected indices were erased from randomly selected 
variables which simulated MCAR missingness of different percentages. 


3.2 Translating Time Series into Images 


A novel approach of encoding time series data into various types of images using 
Gramian Angular Field (GAF) was presented in [12] to improve classification 
and imputation. One of the variants of GAF was Gramian Angular Summation 
Field (GASF), which comprised of multiple steps to perform the encoding. First, 
the time series is scaled within [—1, 1] range. 

i. (a; — Max(X)) + (a; — Min(X)) 


KaT Maz(X) — Min(X) 


(1) 


Here, x; is a specific value at timepoint i where x’, is derived by scaling and 
X is the time series. The time series is scaled within [—1, 1] range in order to be 
represented as polar coordinates achieved by applying angular cosine. 


6; = arccos(x,){-1 <= x, <=1,2) € X} (2) 


The polar encoded time series vector is then transformed into a matrix. If 
the length of the time series vector is n, then the transformed matrix is of shape 
(n x n). 

GASF; j = cos(0; + 0;) (3) 
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The GASF represents the temporal features in the form of an image where 
the timestamps move along top-left to bottom-right, thereby preserving the time 
factor in the data. Figure 1 shows the different steps of time series to image 
transformation. 


TimeSeries Polar Cagrdinates Gramian Angular Summation Field (GASF) 


5 1 18 2 25 


Fig. 1. Time series to image transformation 


The methods of encoding time series into images described in [12] were only 
applicable for univariate time series. The GASF transformation generates one 
image for one time series dimension and thus it is possible to generate multiple 
images for multivariate time series. An approach which vertically stacked images 
transformed from different variables was presented in [13], see Fig. 2. The images 
were grayscaled and the different orders of vertical stacking (ascending, descend- 
ing and random) were examined by performing a statistical test. The stacking 
order did not impact classification accuracy. 


variable 
1 

variable 
2 

variable Vertical Classification 
3 Stacking or 

e ~] Imputation 
Model 


ariable 
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ariable 
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variable 


Fig. 2. Vertical stacking of images transformed from different variables 
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3.3 Convolutional Denoising Autoencoder 


Autoencoder is a very popular unsupervised deep learning model frequently 
found in different application areas. Autoencoder is unsupervised in fashion and 
reconstructs the original input by discovering robust features in the hidden layer 
representation. The latent representation of high dimensional data in the hid- 
den layer contributes in reconstructing the original data. The architecture of 
Autoencoder consists of two principal segments named Encoder and Decoder. 
The Encoder usually compresses the original representation of the data into 
lower dimension. The Decoder decodes the low dimensional representation of 
the input back into its original dimensional representation. 


Encoder(x”) = s(a"We + bp) = z? (4) 


Decoder(x*) = s(x°Wp + bp) = x” (5) 


Here, x” is the original input with n dimensions. s is any non-linear activation 
function, W is weight and b is bias. 

Denoising Autoencoder model is an extension of Autoencoder where the input 
is reconstructed from a corrupted version of it. There are different ways of adding 
corruption, such as Gaussian noise, setting some values to zero etc. The noisy 
input is fed as input and the model minimizes the loss between the clean input 
and corrupted reconstructed input. The objective function looks as follows 


sonstraal (6) 


1 
RMSE(X, X’) y/lXetean -X 


Convolutional Denoising Autoencoder (CDA) incorporates convolution oper- 
ation which is ideally performed in Convolutional Neural Networks (CNN). CNN 
is a methodology, where the layers of perceptrons are replaced by convolution 
layers and convolution operation is performed on the data. Convolution is defined 
as multiplication of two function within a finite or infinite range, where two func- 
tions refer to input data (e.g. Image) and a fixed size kernel consecutively. The 
kernel traverses through the input space to generate feature maps. The feature 
maps consist of important features of the data. The multiple features are pooled, 
preserving important features. 

The combination of convoluted feature maps generation and pooling is per- 
formed in the Encoder layer of CDA where the corrupted version of the input is 
fed into the input layer of the network. The Decoder layer performs Deconvolu- 
tiont and Upsampling which decompresses the output coming from Encoder layer 
back into the shape of input data. The loss between reconstructed data and clean 
data is minimized. In this work, the default architecture of CDA is tweaked in 
the favor of imputing multivariate time series data. Deconvolution and Upsam- 
pling were performed in the Encoder layer and Convolution and Maxpooling 
was performed in Decoder layer. The motivation behind this specific tweaking 
came from [5], where a Denoising Autoencoder was designed with more hidden 
units in the Encoder layer than input layer. The high dimensional representation 
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in Encoder layer created additional feature which was the contributor of data 
recovery. 


3.4 Competitor Models 


Multiple Imputation by Chained Equations (MICE): MICE, which is sometimes 
addressed as fully conditional specification or sequential regression multiple 
imputation, has emerged in the statistical literature as the principal method 
of addressing missing data [1]. MICE creates multiple versions of the imputed 
datasets through multiple imputation technique. 


The steps for performing MICE are the following: 


— A simple imputation method is performed across the time series (mean, mode 
or median). The missing time points are referred as “placeholders”. 

— If there are total m variables having missing points, then one of the vari- 
ables are set back to missing state. The variable with “missing state” label 
is considered as dependent variable and other variables are considered as 
predictors. 

— A regression is performed over these settings and “missing state” variable is 
imputed. Different regressions are supported in this architecture but since the 
dataset only contains continuous values, linear, ridge or lasso regression are 
chosen. 

— The remaining m — 1 “missing state” are regressed and imputed by the same 
way. Once all the m variables are imputed, one iteration is completed. More 
iterations are performed and the imputations are placed in the time series in 
each iteration. 

— The number of iterations can be determined by observing whether coefficients 
of the regression model are converged or not. 


According to the experimental setup of our work, MICE had three different 
regression supports, namely Linear, Ridge and Lasso regression. 


Multi Layer Perceptron (MLP) Based Imputation: The imputation mechanism 
of MLP is inspired by the MICE algorithm. Nevertheless, MLP based impu- 
tation models do not perform the chained or multiple imputations like MICE 
but improve the quality of imputation over several epochs as stochastic gradient 
descent optimizes the weights and biases per epoch. A concrete MLP architec- 
ture was described in literature [10] which was a three layered MLP with the 
hyperbolic tangent activation function in the hidden layer and the identity func- 
tion (linear) as the activation function for the output layer. The train and test 
split were slightly different, where training set and test set consisted of both 
observed and unobserved data. 

The imputation process of MLP model in our work is similar to MICE but 
the non-linear activation function of MLP facilitates finding complex non-linear 
patterns. However, the imputation of a variable is performed only once, in con- 
trast to the multiple iterations in MICE. 
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4 Experiments 


In this section we present the used datasets, the preprocessing steps that 
were conducted before training, the chosen hyperparameters and our evalua- 
tion method. Our complete imputation process for the CDA model is depicted 
in Fig. 3. The process for the competitors is the same except that corrupting the 
training data and turning the time series into images is not being done. 


Simulating A Transforming 
Artificial [pre ees >] corrupting all time series }—> Pest Training }—» Imputation }—» Evaluation 
Missingness P 9 9 into images g 


Fig. 3. Experiment steps for the CDA model 


4.1 Datasets and Data Preprocessing 


Our experiments were conducted on 5 time series datasets from the UEA MTSC 
repository [2]. Each dataset in UEA time series archive has training and test 
splits and specific number of dimensions. Each training or test split represents a 
time series. The table below presents all the relevant structural details (Table 1). 


Table 1. A structural summary of the 5 UEA MTSC dataset 


Dataset name Number of series | Dimensions | Length | Classes 
ArticularyWordRecognition | 275 9 144 25 
Cricket 108 6 1197 12 
Handwriting 150 3 152 26 
StandWalkJump 12 4 2500 3 
UWaveGestureLibrary 120 3 315 

The Length column of the table denotes the length of each time series. In our 


framework, each time series was transformed into images. The number of time 
series for any of the datasets was not very high in number. As we had selected 
a deep learning model for imputation, such low number of samples could cause 
overfitting. Experiments showed us that the default number of time series could 
not perform well. Therefore, the main idea was to increase the number of time 
series by splitting them into multiple parts and reducing their corresponding 
lengths. This modification facilitated us by introducing more patterns for learn- 
ing which aided in imputation. The final lengths chosen were those that yielded 
the best results. The table below presents the modified number of time series 
and lengths for each dataset (Table 2). 
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Table 2. Modified number of time series and lengths 


Dataset name Number of series | Dimension | Length 
ArticularyWordRecognition | 6600 9 6 
Cricket 6804 6 19 
Handwriting 1200 3 19 
StandWalkJump 3000 4 10 
UWaveGestureLibrary 1800 3 21 


The evaluation of the imputation models require a complete dataset and the 
corresponding incomplete dataset. Therefore, artificial missingness was intro- 
duced at different percentages (20%, 30%, 40% and 50%) into all the datasets. 
After simulating artificial missingness, each dataset has an observed part, which 
contains all the time series segments where no variables are missing and an 
unobserved part, where at least one variable is missing. After simulating arti- 
ficial missingness, each dataset had an observed and unobserved split and the 
observed data was further processed for training. As CDA models learn denois- 
ing from a corrupted version of the input, we introduced noise by discarding 
a certain amount of values for each observed case from specific variables and 
replacing them by the mean of the corresponding variables. A higher amount 
of noise has seen to be contributing more in learning dependencies of different 
variables, which leads to denoising of good quality [11]. The variables selected for 
adding noise were the same variables having missing data in unobserved data. 
Different amount of noise was examined but 90% noise lead to good results. 
Unobserved data was also mean imputed as the CDA model would apply the 
denoising technique on the “mean-noise” for imputation. So the CDA learns 
to deal with “mean-noise” on the observed part and is then applied on mean 
imputed unobserved part to create the final imputation. 

The next step was to perform time series to image transformation where, all 
the observed and unobserved chunks were rescaled between —1 to 1 using min- 
max scaling. Rescaled data was further transformed into polar coordinates and 
then GASF encoded image was achieved for each dimension. Multiple images 
referring to multiple variables were vertically aggregated. Finally, both observed 
and unobserved splits consisted their own set of images. 

Note that, the following data preprocessing was performed only for CDA 
based imputation models. The competitor models imputed using the raw format 
of the data. 


4.2 Model Architecture and Hyperparameters 


Our Model architecture was different from a general CDA, where the Encoder 
layer incorporates Deconvolution and Upsampling operations and the Decoder 
layer incorporates Convolution and Maxpooling operations. The Encoder and 
Decoder both have 3 layers. The table below demonstrates the structure of the 
imputation model (Table 3). 
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Table 3. The architecture of CDA based imputation model 


Operation Layer name | Kernel size | Number of feature maps 
Encoder | Upsampling up_0 (2, 2 — 
Deconvolution | deconv_0 (5, 5 64 
Upsampling | up_l (2, 2 — 
Deconvolution | deconv_1 (7, 7 64 
Upsampling | up_2 (2, 2 — 
Deconvolution | deconv_2 (5, 6 128 
Decoder | Convolution | conv_0 (5, 6 128 
Maxpool pool-0 (2, 2 — 
Convolution |conv_1 (7,7 64 
Maxpool pool_1 (2, 2 — 
Convolution | conv_2 (5,5 64 
Maxpool pool_2 (2,2 — 


Hyperparameter specification was achieved by performing random search on 
different random combinations of hyperparameter values and the root mean 
square error (RMSE) was used to decide on the best combination. The random 
search allowed us to avoid the exhaustive searching unlike grid search. Apply- 
ing random search, we selected stochastic gradient descent (SGD) as optimizer, 
which backpropagates the error to optimize the weights and biases. The number 
of epochs was 100 and the batch size was 16. 


4.3 Competitor Model’s Architecture and Hyperparameters 


As competitor models, MICE and MLP based imputation models were selected. 
MLP based model had 3 hidden layers and number of hidden units were 2/3 of 
the number of input units in each layer. The hyperparameters for both of the 
models were tuned by using random search. 

Hyperbolic Tangent Function was selected as activation function with a 
dropout of 0.3. Stochastic Gradient Descent operated as optimizer for 150 epochs 
and with a batch size of 20. 

MICE based imputation was demonstrated using Linear, Ridge and Lasso 
regression and 10 iterations were performed for each of them. 


4.4 Training 


Based on the preprocessed data and model architecture described above, the 
training is started. L2 regularization was used with weight of 0.01 and stochas- 
tic gradient descent was used as the optimizer which outperformed Adam and 
Adagrad optimizers. The whole training process was about learning to mini- 
mize loss between the clean and corrupted data so that it can be applied on 
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the unobserved data (noisy data after mean imputation) to perform imputation. 
The training and validation split was 70% and 30%. Experiments show that, the 
training and validation loss was saturated approximately after 10-15 epochs, 
which was observed for most of the cases. 

The training was conducted on a machine with Nvidia RTX 2060 with RAM 
memory of 16 GB. The programming language for the training and all the steps 
above was Python 3.7 and the operating system was Ubuntu 16.04 LTS. 


4.5 Evaluation Criteria 


As all the time series dataset contain continuous numeric values, Root Mean 
Square Error (RMSE) was selected for evaluation. In out experimental setup, 
RMSE is not calculated on overall time series but only missing data points are 
taken into account to be compared with ground truth while calculating RMSE 


RMSE = Vere 1(xi — v,)?. Where m is the total number of missing time 


i= 


points and I represents all the indices of missing values across the time series. 


5 Results 


Our proposed CDA based imputation model was compared with MLP and three 
different versions of MICE, each using a different type of regression. Figure 4 
presents the RMSE values for 20%, 30% 40% and 50% missingness. 
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Fig. 4. RMSE plots for different missing proportions 
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The RMSE values for the CDA based model are the lowest at every percent- 
age of missingness on the Handwriting, ArticularyWordRecognition, UWaveG- 
estureLibrary and Cricket dataset. The depiction of the results on the Cricket 
dataset is omitted due to space limitations. Unexpectedly, in StandWalkJump 
dataset the performance of MLP and CDA model are very similar, and MLP is 
even better at 30% missingness. MICE (Linear) and MICE (Ridge) are identi- 
cal in imputation for all the datasets. MICE (Lasso) performed worst of all the 
models, which implies that changing the regression type could potentially cause 
an impact on the imputation quality. The MLP model beat all the MICE models 
but was outperformed by the CDA model in at least for 80% of the cases. 


6 Conclusion 


In this work, we introduce an architecture of a Convolutional Denoising Autoen- 
coder (CDA) adapted for multivariate time series imputation which inflates the 
size of the hidden layers in the Encoder instead of reducing them. We also 
employ a preprocessing step that turns the time series into 2D images based 
on Gramian Angular Summation Fields in order to make the data more suitable 
for our CDA. We compare our method against a standard Multi Layer Percep- 
tron (MLP) and the state-of-the-art imputation method Multiple Imputations 
by Chained Equations (MICE) with three different types of regression (Linear, 
Ridge and Lasso). Our experiments were conducted on five different multivariate 
time series datasets, for which we simulated 20%, 30%, 40% and 50% missingness 
with data missing completely at random. Our results show that the CDA based 
imputation outperforms MICE on all five datasets and also beats the MLP on 
four datasets. On the fifth dataset CDA and MLP perform very similarly, but 
CDA is still better on four out of the five degrees of missingness. Additionally we 
present a preprocessing step on the datasets which manipulates the time series 
lengths to generate more training samples for our model which led to a better 
performance. The results show that the CDA model performs strongly against 
both linear and non-linear regression based imputation models. Deep Learning 
Networks are usually computationally more intensive than MICE but the impu- 
tation quality of CDA was convincing enough to be chosen over MICE or MLP 
based imputation. 

In the future we plan to investigate also other types of missing data apart 
from Missing Completely At Random (MCAR) and want to incorporate more 
datasets as well as other deep learning based approaches for imputation. 
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Abstract. Fraud detection is an important research area where machine 
learning has a significant role to play. An important task in that context, 
on which the quality of the results obtained depends, is feature engineer- 
ing. Unfortunately, this is very time and human consuming. Thus, in this 
article, we present the DuSVAE model that consists of a generative model 
that takes into account the sequential nature of the data. It combines two 
variational autoencoders that can generate a condensed representation 
of the input sequential data that can then be processed by a classifier to 
label each new sequence as fraudulent or genuine. The experiments we 
carried out on a large real-word dataset, from the Worldline company, 
demonstrate the ability of our system to better detect frauds in credit 
card transactions without any feature engineering effort. 


Keywords: Anomaly detection - Fraud detection - Sequential data - 
Variational autoencoder 


1 Introduction 


An anomaly (also called outlier, change, deviation, surprise, peculiarity, intru- 
sion, etc.) is a pattern, in a dataset, that does not conform to an expected behav- 
ior. Thus, anomaly detection is the process of finding anomalies in a dataset [4]. 
Fraud detection, a subdomain of anomaly detection, is a research area where the 
use of machine learning can have a significant financial impact for companies 
suffering from large frauds and it is not surprising that a very large amount of 
research has been conducted over many years in that field [1]. 

At the Wordline company, we process billions of electronic transactions per 
year in our highly secured data centers. It is obvious that detecting frauds in 
that context is a very difficult task. For many years, the detection of credit card 
frauds within Wordline has been based on a set of rules manually designed by 
experts. Nevertheless such rules are difficult to maintain, difficult to transfer to 
other business lines, and dependent on experts who need a very long training 
© The Author(s) 2020 
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period. The contribution of machine learning in this context seems obvious and 
Wordline has decided for several years to develop research in this field. 

Firstly, Worldline has put a lot of effort in feature engineering [3,9,12] to 
develop discriminative handcrafted features. This improved drastically super- 
vised learning of classifiers that aim to label card transactions as genuine or 
fraudulent. Nevertheless, designing such features requires a huge amount of time 
and human resources which is very costly. Thus developing automatic feature 
engineering methods becomes a critical issue to improve the efficiency of our 
models. However, in our industrial setting, we have to face with many issues 
among which the presence of highly imbalanced data where the fraud ratio is 
about 0.3%. For this reason, we first focused on classic unsupervised approaches 
in anomaly detection where the objective is to learn a model from normal 
data and then isolate non-compliant samples and consider them as anoma- 
lies [5,17,19, 21,22). 

In this context, Deep autoencoder [7] is considered as a powerful data mod- 
eling tool in the unsupervised setting. An autoencoder (AE) is made up of two 
parts: an encoder designed to generate a compressed coding from the training 
input data and a decoder that reconstructs the original input from the com- 
pressed coding. In the context of anomaly detection [6,20,22], an autoencoder is 
generally trained by minimizing the reconstruction error only on normal data. 
Afterwards, the reconstruction error is applied as an anomaly score. This assumes 
that the reconstruction error for a normal data should be small as it is close to 
the learning data, while the reconstruction error for an abnormal data should 
be high. 

However, this assumption is not always valid. Indeed, it has been observed 
that sometimes the autoencoder generalizes so well that it can also reconstruct 
anomalies, which leads to view some anomalies as normal data. This can also 
be the case when some abnormal data share some characteristics of normal data 
in the training set or when the decoder is “too powerful” to properly decode 
abnormal codings. To solve the shortcomings of autoencoders, [13,18] proposed 
the negative learning technique that aims to control the compressing capacity 
of an autoencoder by optimizing conflicting objectives of normal and abnormal 
data. Thus, this approach looks for a solution in the gradient direction for the 
desired normal input and in the opposite direction for the undesired input. 

This approach could be very appealing to deal with fraud detection prob- 
lems but we found that it is sometimes not sufficient in the context of our data. 
Indeed, it is generally almost impossible to obtain in advance a dataset contain- 
ing all representative frauds, especially in the context where unknown fraudulent 
transactions occur on new terminals or via new fraudulent behaviors. This has 
led us to consider more complex models with variational autoencoders (VAE), a 
probabilistic generative extension of AE, able to model complex generative dis- 
tributions that we found more adapted to efficiently model new possible frauds. 

Another important point for credit card fraud detection is the sequential 
aspect of the data. Indeed, to test a card for example, a fraudster may try to 
make several (small) transactions in a short time interval, or directly perform 
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an abnormally high transaction with respect to existing transactions of the true 
card holder. In fact this sequential aspect has been addressed either indirectly 
via aggregated features [3], that we would like to avoid designing, or directly 
by sequential models such as LSTM, but [9] report nevertheless that the LSTM 
did not improve much the detection performance for e-commerce transactions. 
One of the main contribution of this paper is to propose a method to identify 
fraudulent sequences of credit transactions in the context of highly imbalanced 
data. For this purpose, we propose a model called DuSVAE, for Dual Sequen- 
tial Variational Autoencoders, that consists of a combination of two variational 
autoencoders. The first one is trained from fraudulent sequences of transactions 
in order to be able to project the input data into another feature space and to 
assign a fraud score to each sequence thanks to the reconstruction error informa- 
tion. Once this model is trained, we plug a second VAE at the output of the first 
one. This second VAE is then trained with a negative learning approach with 
the objective to maximize the reconstruction error of the fraudulent sequences 
and minimize the reconstruction error of the genuine ones. 

Our method has been evaluated on a Wordline dataset for credit card fraud 
detection. The obtained results show that DuSVAE can extract hidden represen- 
tations able to provide results close to those obtained after a significant work of 
feature engineering, therefore saving time and human effort. It is even possible 
to improve the results when combining engineered features with DuSVAE. 

The article is organized as follows: some preliminaries about the techniques 
used in this work are given in Sect. 2. Then we describe the architecture and the 
training strategy of the DusVAE method in Sect.3. Experiments are presented 
in Sect. 4 after a presentation of the dataset and useful metrics. Finally Sect. 5 
concludes this article. 


2 Preliminaries 


In this section, we briefly describe the main techniques that are used in DuSVAE: 
vanilla and variational autoencoders, negative learning and mixture of experts. 


2.1 Autoencoder (AE) 


An AE is a neural network [7], which is optimized in an unsupervised manner, 
usually used to reduce the dimensionality of the input data. It is made up of 
two parts linked together: an encoder E(x) and a decoder D(z). Given an input 
sample x, the encoder generates z, a condensed representation of x. The decoder 
is then tuned to reconstruct the original input x from the encoded representation 
z. The objective function used during the training of the AE is given by: 


Lax(x) = ||z — D(E(2))| (1) 


where ||- || denotes an arbitrary distance function. The l2 norm is typically 
applied here. The AE can be optimized for example using stochastic gradient 
descent (SGD) [10]. 


Dual Sequential Variational Autoencoders for Fraud Detection 17 


2.2 Variational Autoencoder (VAE) 


A VAE [11,16] is an attractive probabilistic generative version of the standard 
autoencoder. It can learn a complex distribution and then use it as a generative 
model defined by a prior p(z) and conditional distribution pg(x|z). Due to the 
fact that the true likelihood of the data is generally intractable, a VAE is trained 
through maximizing the evidence lower bound (ELBO): 


L(x; 0, $) = Eq, z\2) [log po(x|z)] — Dri (do(212)||P(2)) (2) 


where the first term Ey,,(z\~) [log po(a|z)] is a negative reconstruction loss that 
enforces q¢(z|x) (the encoder) to generate a meaningful latent vector z, so that 
po(x|z) (the decoder) can reconstruct the input x from z. The second term 
Dki (96(2|@)||p(z)) is a KL regularization loss that minimizes the KL divergence 
between the approximate posterior qg(z|x) and the prior p(z) = (0,1). 


2.3 Negative Learning 


Negative learning is a technique used for regularizing the training of the AE in 
the presence of labelled data by limiting reconstruction capability (LRC) [13]. 
The basic idea is to maximize the reconstruction error for abnormal instances, 
while minimizing the reconstruction error for normal ones in order to improve the 
discriminative ability of the AE. Given an input instance x € R” and y € {0,1} 
denotes its associated label where y = 1 stands for a fraudulent instance and 
y = 0 for a genuine one. The objective function of LRC to be minimized is: 


(1— y)£Lan(z) — (y)£Lan(z) (3) 


Training LRC-based models has the major disadvantage to be generally 
unstable due to the fact that the anomaly reconstruction error is not upper 
bounded. The LRC approach tends then to maximize the reconstruction error 
for known anomalies rather than minimizing the reconstruction error for normal 
points leading to a bad reconstruction of normal data points. To overcome this 
problem, [18] has proposed Autoencoding Binary Classifiers (ABC) for super- 
vised anomaly detection that improves LRC by using an objective function based 
on a bounded reconstruction loss for anomalies, leading to a better training sta- 
bility. The objective function of the ABC to be minimized is: 


(1 — y)Lan(x) — yloga(1 — e~*42™) (4) 


2.4 Mixture-of-Experts Layer (MoE) 


In addition to the previous methods, we now present the notion of MoE layer [8] 
that will be used in our model. 

The MoE layer aims to combine the outputs of a group of n neural networks 
called experts EX1, EX2, ...., EXn. The experts have their specific parameters 
but work on the same input, their n output are combined linearly with the 
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MoE layer 


Fig. 1. An illustration of the MoE layer architecture 


outputs of the gating network G which weights the experts according to the 
input x. See Fig.1 for an illustration. Let E;(a) be the output of expert EX;, 
and G(x); be the i” attribute of G(x), then the output y of the MoE is defined 


as follows: n 


y = X G(x): EX;(2). (5) 
i=1 

The intuition behind MoE layers is to train different network experts that 
can focus on specific peculiarities of the data and then choose an appropriate 
combination of experts with respect to the input x. In our industrial context, 
such a layer would help us to take into account different behaviors from millions 
of cardholders, which results in a variety of data distributions. The different 
expert networks can thus model various behaviors observed in the dataset and 

be combined adequately in function of the input data. 


3 The DuSVAE Model 


In this section, we present our approach to extract a hidden representation of 
input sequences to be used for anomaly /fraud detection. We first introduce the 
model architecture with the loss functions used, then we describe the learning 
procedure used to train the model. 


3.1 Model Architecture 


We assume in the following that we are given as input a set of sequences V¥ = 
{x | x = (t',t?,.....,) with tê € RI}, every sequence being composed of m 
transactions encoded by numerical vectors. Each sequence is associated to a 
label y € {0,1} such that y = 1 indicates a fraudulent sequence and y = 0a 
genuine one. We label a sequence as fraudulent if its last transaction is a fraud. 

As illustrated in Fig.2, our approach consists of two sequential variational 
autoencoders. The first one is trained only on fraudulent sequences of the training 
data. We use the generative capacity of this autoencoder to generate diverse and 
representative instances of fraudulent instances with respect to the sequences 
given as input. This autoencoder has the objective to prepare the data for the 
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Fig. 2. The DuSVAE model architecture 


second autoencoder and to provide also a first anomaly/fraud score with the 
reconstruction error. 

The first layers of the autoencoders are bi-directional GRU layers allowing us to 
handle sequential data. The remaining parts of the encoder and the decoder contain 
GRU and fully connected (FC) layers, as shown in Fig. 2. The loss function used 
to optimize the reconstruction error of the first autoencoder is defined as follows: 


Freel ts Qı, 01) = mse(x, Do, (Eo, (x))) + Deu (gay (z|x)||p(z)) , (6) 


where mse is the mean square error function and p(z) = N (0, I). The encoder 
Eg, (x) generates a latent representation z according to qg, (z|x) = N (u1,01). 
The decoder Dg, tries to reconstruct the input sequence from z. In order to 
avoid mode collapse between the reconstructed transactions of the sequence, 
we add the following loss function to control the reconstruction of individual 
transactions with respect to relative distances from an input sequence zx: 


m om 4 ie el 
Lire AB(®, 61,91) = p> „alast *— t) — abst — T )|la (7) 


where f’ is the reconstruction obtained by the AE for the ¿t? transaction of the 
sequence and abs(t) returns a vector where the features are the absolute values 
of the original input vector t. 

So, we train the parameters (#1, 91) of the first autoencoder by minimizing the 
following loss function over all the fraudulent sequences of the training samples: 


Li(a, Qı, 01) — Lveol@, $1, 91) + NLere(Z, $1, 91), (8) 


where A is a tradeoff parameter. 

The second autoencoder is then trained over all the training sequences by 
negative learning. It takes as input both a sequence x and its reconstructed 
version from the first autoencoder AF (x) that corresponds to the output of its 
last layer. The loss function considered to optimize the parameters (d2, 02) of 
the second autoencoder is then defined as follows: 
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Lo(a, AE; (x), $2,02) = (1 _ y)L£1(«, b2, 82) 
—y(Li (a, $1, 01) + €) logg(1 — e~ £1 F2-%2)), (9) 


where £1(x,¢1,01) denotes the reconstruction loss £; rescaled in the (0, 1]- 
interval with respect to all fraudulent sequences and e€ is a small value used 
to smooth very low anomaly scores. The architecture of this second autoencoder 
is similar to that of the first one, except that we use a MoE layer to compute the 
mean of the normal distribution N (u2, 72) defined by the encoder. As said pre- 
viously, the objective is to take into account the variety of the different behavior 
patterns found in our genuine data. The experts used in that layer are simple 
one-layer feed-forward neural networks. 


3.2 The Training Strategy 


The global learning algorithm is presented in Algorithm 1. We have two training 
phases, the first one focuses on training the first autoencoder AF; as a backing 
model for the second phase. It is trained only on fraudulent sequences by mini- 
mizing Eq. 8. Once the model has converged, we freeze its weights and start the 
second phase. For training the second autoencoder AFE», we use both genuine 
and fraudulent sequences and their reconstructed versions given by AE). We 
then optimize the weights of AE2 by minimizing Eq. 9. To control the imbalance 
ratio, the training is done at each iteration by sampling n examples from fraudu- 
lent sequences and n from genuine sequences. We repeat this step iteratively by 
increasing the number n of sampled transactions for each novel iteration until 
the model converges. 


Algorithm 1. Dual sequential variational autoencoder (DuSVAE) 


Input: X, genuine data, Xs fraudulent data. 
Parameters: n number of sampled examples; h increment step. 
Output: AE; Autoencoder, AE2 Autoencoder. 
repeat 

Train AF, on 4; by minimizing Equation 8 
until convergence 
Freeze the weights of AE} 
repeat 

Xı — Sample(Xs,n) U Sample(X,,n) 

Xo —_ AE, (£) 

Train AF2 on (41, X2) by minimizing Equation 9 

if n < |X;| then 

n—-nt+h 

end if 

: until convergence 


PRR RP Re 
oR WN OSAMA aR why 
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Table 1. Properties of the Worldline dataset used in the experiments. 


Train (01/01-21/03) | Validation (22/03-31/03) | Test (01/04-30/04) 
# of genuine 25,120,194 3,019,078 9,287,673 
# of fraud 88,878 9,631 29,614 
Total 25,209,072 3,028,709 9,317,287 
Imbalance ratio | 0.003526 0.00318 0.003178 


4 Experiments 


In this section, we provide an experimental evaluation of our approach on a 
real-world dataset of credit card e-payment transactions provided by Worldline. 
First, we present the dataset, then we present the metrics used to evaluate the 
models learned by our system and finally, we compare DuSVAE with other state- 
of-the-art approaches. 


4.1 Dataset 


The dataset provided by Wordline covers 4 months of credit card e-payment 
transactions made by European cardholders in e-commerce mode that has been 
splitted into Train, Validation and Test sets used respectively to train, tune 
and test the learned models. Its main challenges have been studied in [2], one of 
them being the imbalance ratio as we can see on Table 1 that presents the main 
characteristics of this dataset. 

Each transaction is described by 12 features. A Boolean value is assigned 
to each transaction to specify whether it corresponds to a fraud or not. This 
labeling is handled by a team of human experts. 

Since most features have a large number of values, using brute one-hot encod- 
ing would generate a huge number of features. For example the “Merchant Cate- 
gory Code” feature has 283 possible values and one-hot encoding would produce 
283 new features. That would make our approach inefficient. Thus, before using 
one-hot encoding, we transform each categorical value of each feature by a score 
which is its risk to be associated with a fraudulent transaction. Let’s consider 
for example a categorical feature f. We can compute the probability of the jt” 
value of feature f to be associated with a fraudulent transaction, denoted as 


bj, as follows: 8; = Mat, : where N; + is the number of fraudulent transactions 
where the value of E fis r fe j and Ny-; is the total number of trans- 
actions where the value of feature f is equal to j. In order to take into account 
the number of transactions related to a particular value of a given feature, we 
follow [14]. For each value j of a given feature, the fraud score S} for this value 
is defined as follows: 

Sj = alb; + (1 — af) AFP (10) 


This score computes a weighted value of 3; and the probability of having a fraud 
in a day (Average Fraud Probability: AFP). The weight a; is a normalized value 
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of a; in the range [0, 1], where a; is defined as the proportion of the number of 
Ngai 


transactions for that value on the total number N of transactions: a; = 

Having replaced each value for each feature by its score, we can then run 
one-hot encoding and thus significantly reduce the number of features generated. 
For example, the “Merchant Category Code” feature has 283 possible values and 
instead of generating 283 features, this technique produces only 29 features. 

Finally, to generate sequences from transactions, we grouped all the transac- 
tions by cardholder ID and we ordered each cardholder’s transactions by time. 
Then, with a sliding window over the transactions we obtained a time-ordered 
sequence of transactions for each cardholder. For each sequence, we have assigned 
the label fraudulent or genuine of its last transaction. 


4.2 Metrics 


In the context of fraud detection, fortunately, the number of fraudulent transac- 
tions is significantly lower than the number of normal transactions. This leads to 
a very imbalanced dataset. In this situation, the traditional performance mea- 
sures are not appropriate. Indeed, with an overall fraud rate of 0.3%, classifying 
each transaction as normal leads to an accuracy of 99.7%, despite the fact that 
the model is absolutely naive. That means we have to choose appropriate per- 
formance measures that are robust in the case of imbalanced data. In this work 
we rely on the area under the precision-recall curve (AUC-PR) as a robust and 
clear measure of the accuracy of the classifier in an imbalanced setting. Each 
point of the precision-recall curve corresponds to the precision of the classifier 
at a specific recall level. 

Once an alert is raised after a fraud has been detected, fraud experts can con- 
tact the card-holder to check the validity of suspicious transactions. So, within 
a single day, the fraud experts have to check a large number of transactions pro- 
vided by the fraud detection system. Consequently, the precision of the transac- 
tions highlighted as fraud is an important metric because that can help human 
experts at Worldline to focus on the most important frauds and leave aside minor 
frauds due to lack of time to process them. For this purpose, we rely on the Pax 
as a global metric to compare models. It is the average of the precision of the 
first K transactions which are calculated according to the following equation. 


K 
1 
AveragePax = K 2 Pai (11) 


4.3 Comparison with the State of the Art 


We compare our approach with the following methods: variational autoencoder 
[11,16] trained on fraudulent or genuine data only (VAE(F) or VAE(G) respec- 
tively); limiting reconstruction capability (LRC) [13] and autoencoding binary 
classifiers for supervised anomaly detection (ABC) [18]. It is important to note 
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Table 2. AUC-PR achieved by CatBoost using various autoencoder models 


Models Raw Reconstructed | Reconstruction error | Codel | Code2 
Trx | Seq | Trx | Seq 

VAE (F) | 0.19] 0.40} 0.36 | 0.38 0.29 0.30 | 0.27 

VAE (G) 0.42 | 0.43 0.31 0.32 0.33 

LRC 0.46 | 0.46 0.17 0.28 | 0.13 

ABC 0.48 | 0.50 0.37 0.32 0.3 

DuSVAE 0.51 | 0.53 0.36 0.50 0.49 


that ABC and LRC are not sequential models by nature. So, to make our com- 
parison more fair, we adapted their implementation to allow them to process 
sequential data. As a classifier, we used CatBoost [15] which is robust in the 
context of imbalanced data and efficient on GPUs. 

First, as we can observe in Table 2, the AUC-PR values obtained by running 
CatBoost directly on transactions and sequences of transactions are respectively 
equal to 0.19 and 0.40. If we look at the AUC-PR values obtained by running 
CatBoost on the reconstructed transactions and sequences of transactions, we 
can observe that the results are always greater than those obtained by running 
CatBoost on raw data. Moreover it is interesting to note that DuSVAE achieved 
the best results (0.51 and 0.53) compared to other state-of-the-art systems. 

Now, if we look at the performance obtained by CatBoost on the hidden 
representation vectors Codel and Code2, we observe that DuSVAE outperforms 
the results obtained by other state-of-the-art systems and those results are quite 
similar to the ones obtained on the reconstructed sequences of transactions. This 
is interesting because it means that using DuSVAE a condensed representation 
of the input data can be obtained, which still gives approximately the same 
results as on the reconstructed sequences of transactions but that are of higher 
dimensionality (about 10 times more) and can be less efficiently processed by 
the classifier. Finally, when using the reconstruction error as a score to classify 
fraudulent data, as done usually in anomaly detection, we can observe that 
DuSVAE is competitive with the best method. However, the performance level 
of Codel and Code2 with CatBoost being significantly better makes the use of 
the hidden representations a better strategy than using the reconstruction error. 

We then evaluated the impact of handcrafted features built by Worldline on 
the classifier performance. As we can see on the first two lines of Table3, adding 
handcrafted features to the original sequential raw dataset leads to much better 
results both from the point of view of AUC-PR measure and PQK measure. 

Now if we consider using DuSVAE (rows 3 and 4 of Table3), we can also 
notice a significant improvement of the results obtained on the raw dataset of 
sequences augmented by handcrafted features compared to the results obtained 
on the original one without these additional features. This is observed for both 
the AUC-PR measure and the PQK measure. We see that, for the moment, 
by using a classifier on the sequences reconstructed by DuSVAE on just the 
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Table 3. AUC-PR and P@K achieved by CatBoost for sequence classification. 


Input AUC-PR | P@100 P@500 
Raw data 0,40 0.43 0.11 
Raw data + Handcrafted features 0,60 0.62 0.938 
DuSVAE 0,53 0.88 | 0.72 
(The input:raw data) 

DuSVAE 0,65 0.85 | 0.941 
(The input: raw data + Handcrafted features) 


raw dataset (AUC-PR = 0.53), we cannot reach the results obtained when we 
use this classifier on the raw dataset augmented by handcrafted features (AUC- 
PR=0.60). This can be explained by the fact that those features are based 
on history and profiling techniques that embed information covering a period 
of time larger than the one used for our dataset. Nevertheless we are not so 
far and the fact that using DuSVAE on the dataset augmented by handcrafted 
features (AUC-PR = 0.65) leads to better results than using the classifier without 
DuSVAE (AUC-PR=0.60) is promising. 

Table 3 also shows that the very good PQK values obtained when running the 
classifier on the sequences of transactions reconstructed by DuSVAE mean that 
DuSVAE can be a very significant help for experts to focus on real fraudulent 
transactions and not waste time on fake ones. 


5 Conclusion 


In this paper, we presented the DuSVAE model which is a new fraud detection 
technique. Our model combines two sequential variational autoencoders to pro- 
duce a condensed representation vector of the input sequential data that can 
then be used by a classifier to label new sequences of transactions as genuine or 
fraudulent. Our experiments have shown that the DuSVAE model produces much 
better results, in terms of AUC-PR and Pax measures, than state-of-the-art sys- 
tems. Moreover, the DuSVAE model produces a condensed representation of the 
input data which can replace very favorably the handcrafted features. Indeed, 
running a classifier on the condensed representation of the input data built by 
the DuSVAE model leads to outperform the results obtained on the raw data, 
with or without handcrafted features. 

We believe that a first interesting way to further improve our results will be 
to focus on attention mechanisms to better take into account the history of past 
transactions in the detection of present frauds. A second approach will be to 
better take into account the temporal aspects in the sequential representation 
of our data and to reflect it in the core algorithm. 
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Abstract. Graph neural networks (GNNs) have known an increasing 
success recently, with many GNN variants achieving state-of-the-art 
results on node and graph classification tasks. The proposed GNNs, 
however, often implement complex node and graph embedding schemes, 
which makes it challenging to explain their performance. In this paper, 
we investigate the link between a GNN’s expressiveness, that is, its abil- 
ity to map different graphs to different representations, and its gener- 
alization performance in a graph classification setting. In particular, we 
propose a principled experimental procedure where we (i) define a prac- 
tical measure for expressiveness, (ii) introduce an expressiveness-based 
loss function that we use to train a simple yet practical GNN that is 
permutation-invariant, (iii) illustrate our procedure on benchmark graph 
classification problems and on an original real-world application. Our 
results reveal that expressiveness alone does not guarantee a better per- 
formance, and that a powerful GNN should be able to produce graph 
representations that are well separated with respect to the class of the 
corresponding graphs. 


Keywords: Graph neural networks - Classification - Expressiveness 


1 Introduction 


Many real-world data present an inherent structure and can be modelled as 
sequences, graphs, or hypergraphs [2,5,9,15]. Graph-structured data, in partic- 
ular, are very common in practice and are at the heart of this work. 

We consider the problem of graph classification. That is, given a set 
G = {Gi}%, of arbitrary graphs and their respective labels {y;}?%;4, where 
yi E€ {1,...,C} and C is the number of classes, we aim at finding a mapping 
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fo: G — {1,...,C} that minimizes the classification error, where 6 denotes the 
parameters to optimize. 

Graph neural networks (GNNs) and their deep learning variants, the graph 
convolutional networks (GCNs) [1,7,9,10,13,17,20,27], have gained consider- 
able interest recently. GNNs learn latent node representations by recursively 
aggregating the neighboring node features for each node, thereby capturing the 
structural information of a node’s neighborhood. 

Despite the profusion of GNN variants, some of which achieve state-of-the-art 
results on tasks like node classification, graph classification, and link prediction, 
GNNs remain very little studied. In particular, it is often unclear what a GNN 
learns and how the learned graph (or node) mapping influences its generalization 
performance. In a recent work, [25] present a theoretical framework to analyze 
the expressive power of GNNs, where a GNN’s expressiveness is defined as its 
ability to compute different graph representations for different graphs. Theoreti- 
cal conditions under which a GNN is maximally expressive are derived. Although 
it is reasonable to assume that a higher expressiveness would result in a higher 
accuracy on classification tasks, this link has not been explicitly studied so far. 

In this paper, we design a principled experimental procedure to analyze the 
link between expressiveness and the test accuracy of GNNs. In particular: 


— We define a practical measure to estimate the expressiveness of GNNs; 
— We use this measure to define a new penalized loss function that allows train- 
ing GNNs with varying expressive power. 


To illustrate our experimental framework, we introduce a simple yet practical 
architecture, the Simple Permutation-Invariant Graph Convolutional Network 
(SPI-GCN). We also present an original graph data set of metal hydrides that 
we use along with benchmark graph data sets to evaluate SPI-GCN. 

This paper is organized as follows. Section2 discusses the related work. 
Section 3 introduces preliminary notations and concepts related to graphs and 
GNNs. In Sect. 4, we introduce our graph neural network, SPI-GCN. In Sect. 5, 
we present a practical expressiveness estimator and a new expressiveness-based 
loss function as part of our experimental framework. Section6 presents our 
results and Sect. 7 concludes the paper. 


2 Related Work 


Graph neural networks (GNNs) were first introduced in [11,19]. They learn latent 
node representations by iteratively aggregating neighborhood information for 
each node. Their more recent deep learning variants, the graph convolutional 
networks (GCNs), generalize conventional convolutional neural networks to irreg- 
ular graph domains. In [13], the authors present a GCN for node classification 
where the computed node representations can be interpreted as the graph col- 
oring returned by the 1-dimensional Weisfeiler-Lehman (WL) algorithm [24]. A 
related GCN that is invariant to node permutation is presented in [27]. The graph 
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convolution operator is closely related to the one in [13], and the authors intro- 
duce a permutation-invariant pooling operator that sorts the convolved nodes 
before feeding them to a 1-dimensional classical convolution layer for graph-level 
classification. A popular GCN is PATCHY-SAN [17]. Its graph convolution oper- 
ator extracts normalized local “patches” (neighborhood representations) of the 
graph which are then sorted and fed to a 1-dimensional traditional convolution 
layer for graph-level classification. The method, however, requires the definition 
of a node ordering and running the WL algorithm in a preprocessing step. On 
the other hand, the normalization of the extracted patches implies sorting the 
nodes again and using the external graph software Nauty [14]. 

Despite the success of GNNs, there are relatively few papers that analyze 
their properties, either mathematically or empirically. A notable exception is the 
recent work by [25] that studies the expressive power of GNNs. The authors prove 
that (i) GNNs are at most as powerful as the WL test in distinguishing graph 
structures and that (ii) if the graph function of a GNN—i.e. its graph embedding 
scheme—is injective, then the GNN is as powerful as the WL test. The authors 
also present the Graph Isomorphism Network (GIN), which approximates the 
theoretical maximally expressive GNN. In another study [4], the authors present 
a simple neural network defined on a set of graph augmented features and show 
that their architecture can be obtained by linearizing graph convolutions in 
GNNs. 

Our work is related to [25] in that we adopt the same definition of expres- 
siveness, that is, the ability of a GNN to compute distinct graph representations 
for distinct input graphs. However, we go one step further and investigate how 
the graph function learned by GNNs affects their generalization performance. 
On the other hand, our SPI-GCN extends the GCN in [13] to graph-level clas- 
sification. Our SPI-GCN is also related to [27] in that we use a similar graph 
convolution operator inspired by [13]. Unlike [27], however, our architecture does 
not require any node ordering, and we only use a simple multilayer perceptron 
(MLP) to perform classification. 


3 Some Graph Concepts 


A graph G is a pair (V, E) of a set V = {v1,..., Un } of vertices (or nodes) v;, and 
aset E C V x V of edges (v;, v;). In this work, we represent a graph G by two 
matrices: (i) an adjacency matrix A € R”*” such that a;j = 1 if there is an edge 
between nodes v; and vj and a;; = 0 otherwise,’ and (ii) a node feature matrix 
X € R"*¢, with d being the number of node features. Each row x; € R? of X 
contains the feature representation of a node v;, where d is the dimension of the 
feature space. Since we only consider node features in this paper (as opposed to 
edge features for instance), we will refer to the node feature matrix X simply as 
the feature matrix in the rest of this paper. 


1 Given a matrix M, m; denotes its ith row and Mij denotes the entry at its ith row 
and jth column. More generally, we denote matrices by capital letters and vectors 
by small letters. Scalars, on the other hand, are denoted by small italic letters. 
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An important notion in graph theory is graph isomorphism. Two graphs 
G, = (Vi, E1) and G2 = (V2, E2) are isomorphic if there exists a bijection 
g: Vi > Və such that every edge (u, v) is in E if and only if the edge (g(u), g(v)) 
is in Ey. Informally, this definition states that two graphs are isomorphic if there 
exists a vertex permutation such that when applied to one graph, we recover the 
vertex and edge sets of the other graph. 


3.1 Graph Neural Networks 


Consider a graph G with adjacency matrix A and feature matrix X. GNNs 
use the graph structure (A) and the node features (X) to learn a node-level or 
a graph-level representation—or embedding—of G. GNNs iteratively update a 
node representation by aggregating its neighbors’ representations. At iteration J, 
a node representation captures its l-hop neighborhood’s structural information. 
Formally, the lth layer of a general GNN can be defined as follows: 


alt! = AGGREGATE! ({z! : j € N(i)}) (1) 
zit? = COMBINE! (zi, alt?) , (2) 
where z{t! is the feature vector of node v; at layer | and where z? = x;. While 


COMBINE usually consists in concatenating node representations from different 
layers, different—and often complex—architectures for AGGREGATE have been 
proposed. In [13], the presented GCN merges the AGGREGATE and COMBINE 
functions as follows: 


zit! = ReLU (mean( {2} jE NU {i}}) w’) , (3) 


where ReLU is a rectified linear unit and W' is a trainable weight matrix. GNNs 
for graph classification have an additional module that aggregates the node-level 
representations to produce a graph-level one as follows: 


za = READOUT ({z? : v; € V}) , (4) 


for a GNN with L layers. In [25], the authors discuss the impact that the choice 
of AGGREGATE’, COMBINE’, and READOUT has on the so-called expres- 
siveness of the GNN, that is, its ability to map different graphs to different 
embeddings. They present theoretical conditions under which a GNN is maxi- 
mally expressive. 

We now present a simple yet practical GNN architecture on which we illus- 
trate our experimental framework. 


4 Simple Permutation-Invariant Graph Convolutional 
Network (SPI-GCN) 


Our Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN) 
consists of the following sequential modules: (1) a graph convolution module 
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that encodes local graph structure and node features in a substructure feature 
matrix whose rows represent the nodes of the graph, (2) a sum-pooling layer as 
a READOUT function to produce a single-vector representation of the input 
graph, and (3) a prediction module consisting of dense layers that reads the 
vector representation of the graph and outputs predictions. 

Let G be a graph represented by the adjacency matrix A € R"*” and the 
feature matrix X € R"*¢, where n and d represent the number of nodes and 
the dimension of the feature space respectively. Without loss of generality, we 
consider graphs without self-loops. 


4.1 Graph Convolution Module 


Given a graph G with its adjacency and feature matrices, A and X, we define 
the first convolution layer as follows: 


Z= f(D AXW) , (5) 


where A = A + I, is the adjacency matrix of G with added self-loops, D is the 
diagonal node degree matrix of AS W € R@®® is a trainable weight matrix, f is 
a nonlinear activation function, and Z € R”*@ is the convolved graph. To stack 
multiple convolution layers, we generalize the propagation rule in (5) as follows: 


Zit = gp ÂZ'W) , (6) 


where Z? = X, Z! is the output of the Ith convolution layer, W’ is a trainable 
weight matrix, and f’ is the nonlinear activation function applied at layer l. 
Similarly to the GCN presented in [13] from which we draw inspiration, our 
graph convolution module merges the AGGREGATE and COMBINE functions 
(see (1) and (2)), and we can rewrite (6) as: 


att = f! (mean({2} : j E€ NOU: W’) , (7) 
where zit! is the ith row of Z'T?. 
We return the result of the last convolution layer, that is, for a network with 


L convolution layers, the result of the convolution is the last substructure feature 
matrix Z”. Note that (6) is able to process graphs with varying node numbers. 


4.2 Sum-Pooling Layer 


The sum-pooling layer produces a graph-level representation zg by summing the 
rows of Z”, previously returned by the convolution module. Formally: 


ZG = Saf f (8) 
i=1 


? If G is a directed graph, D corresponds to the outdegree diagonal matrix of A. 
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The resulting vector zg € R“ contains the final vector representation (or embed- 
ding) of the input graph G in a d;-dimensional space. This vector representation 
is then used for prediction—graph classification in our case. 

Using a sum pooling operator is a simple idea that has been used in GNNs 
such as [1,21]. Additionally, it results in the invariance of our architecture to 
node permutation, as stated in Theorem 1. 


Theorem 1. Let G and G, be two arbitrary isomorphic graphs. The sum-pooling 
layer of SPI-GCN produces the same vector representation for G and Gg. 


This invariance property is crucial for GNNs as it ensures that two isomorphic— 
and hence equivalent—graphs will result in the same output. The proof of The- 
orem 1 is straightforward and omitted for space limitations. 


4.3 Prediction Module 


The prediction module of SPI-GCN is a simple MLP that takes as input the 
graph-level representation zg returned by the sum-pooling layer and returns 
either: (i) a probability p in case of binary classification or (ii) a vector p of 
probabilities such that 5°; p; = 1 in case of multi-class classification. 

Note that SPI-GCN can be trained in an end-to-end fashion through back- 
propagation. Additionally, since only one graph is treated in a forward pass, the 
training complexity of SPI-GCN is linear in the number of graphs. 

In the next section, we describe a practical methodology for studying the 
expressiveness of SPI-GCN and its connection to the generalization performance 
of the algorithm. 


5 Investigating Expressiveness of SPI-GCN 


We start here by introducing a practical definition of expressiveness. We then 
show how the defined measure can be used to train SPI-GCN and help under- 
stand the impact expressiveness has on its generalization performance. 


5.1 Practical Measure of Expressiveness 


The expressiveness of a GNN, as defined in [25], is its ability to map different 
graph structures to different embeddings and, therefore, reflects the injectivity 
of its graph embedding function. Since studying injectivity can be tedious, we 
characterize expressiveness—and hence injectivity—as a function of the pairwise 
distance between graph embeddings. 
Let {zG,}%%, be the set of graph embeddings computed by a GNN A for 
a given input graph data set {G;}”™,. We define A’s expressiveness, E( A), as 
follows: 
€(A) = mean({||ze, — za, ll2 : ij =1,...,m, TAF} , (9) 


that is, E(A) is the average pairwise Euclidean distance between graph embed- 
dings produced by A. While not strictly equivalent to injectivity, E is a reasonable 
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indicator thereof, as the average pairwise distance reflects the diversity within 
graph representations which, in turn, is expected to be higher for more diverse 
input graph data sets. For permutation-invariant GNNs like SPI-GCN,? € is zero 
when all graphs {G;}?, are isomorphic. 


5.2 Penalized Cross Entropy Loss 


We train SPI-GCN using a penalized cross entropy loss, Lp, that consists of a 
classical cross entropy augmented with a penalty term defined as a function of 
the expressiveness of SPI-GCN. Formally: 


Lp = cross-entropy({yi}1, {8:}1) —a-E(SPEGCN) , (10) 


where {y;}?%, (resp. {Gj} ) is the set of real (resp. predicted) graph labels, a 
is a non-negative penalty factor, and £ is defined in (9) with {ze, }:&; being the 
graph embeddings computed by SPI-GCN. 

By adding the penalty term —a- €(SPI-GCN) in £p, the expressiveness is 
maximized while the cross entropy is minimized during the training process. 
The penalty factor a controls the importance attributed to €(SPI-GCN) when 
Lp is minimized. Consequently, higher values of a allow to train more expressive 
variants of SPI-GCN whereas for a = 0, only the cross entropy is minimized. 

In the next section, we assess the performance of SPI-GCN for different values 
of a. We also compare SPI-GCN with other more complex GNN architectures, 
including the state-of-the-art method. 


6 Experiments 


We carry out a first set of experiments where we compare our approach, SPI- 
GCN, with two recent GCNs. In a second set of experiments, we train different 
instances of SPI-GCN with increasing values of the penalty factor a (see (10)) 
in an attempt to understand how the expressiveness of SPI-GCN affects its test 
accuracy, and whether it is the determining factor of its generalization perfor- 
mance, as implicitly suggested in [25]. Our code and data are available at https: // 
github.com/asmaatamna/SPI-GCN. 


6.1 Data Sets 


We use nine public benchmark data sets including five bioinformatics data sets 
(MUTAG [6], PTC [22], ENZYMES [3], NCI1 [23], PROTEINS [8]), two social 
network data sets (IMDB-BINARY, IMDB-MULTI [26]), one image data set 
where images are represented as region adjacency graphs (COIL-RAG [18]), and 
one synthetic data set (SYNTHIE [16]). We also evaluate SPI-GCN on an original 
real-world data set collected at the ICMPE,4 HYDRIDES, that contains metal 
hydrides in graph format, labelled as stable or unstable according to specific 
energetic properties that determine their ability to store hydrogen efficiently. 


3 As mentioned previously, we state that permutation-invariance is a minimal require- 
ment for any practical GNN. 
4 East Paris Institute of Chemistry and Materials Science, France. 
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6.2 Architecture of SPI-GCN 


The instance of SPI-GCN that we use for experiments has two graph convolution 
layers of 128 and 32 hidden units respectively, followed by a hyperbolic tangent 
function and a softmax function (per node) respectively. The sum-pooling layer 
is a classical sum applied row-wise; it is followed by a prediction module con- 
sisting of a MLP with one hidden layer of 256 hidden units followed by a batch 
normalization layer and a ReLU. We choose this architecture by trial and error 
and keep it unchanged throughout the experiments. 


6.3 Comparison with Other Methods 


In these experiments, we consider the simplest variant of SPI-GCN where the 
penalty term in (10) is discarded by setting a = 0. That is, the algorithm is 
trained using only the cross entropy loss. 


Baselines. We compare SPI-GCN with the well-known GCN, PATCHY-SAN 
(PSCN) [17], the Deep Graph Convolutional Neural Network (DGCNN) [27] that 
uses a similar convolution module to ours, and the recent state-of-the-art Graph 
Isomorphism Network (GIN) [25]. 


Experimental Procedure. We train SPI-GCN using full batch ADAM opti- 
mizer [12], with cross entropy as the loss function to minimize (a = 0 in (10)). 
Upon experimentation, we set ADAM’s hyperparameters as follows. The algo- 
rithm is trained for 200 epochs on all data sets and the learning rate is set 
to 1078. To estimate the accuracy, we perform 10-fold cross validation using 9 
folds for training and one fold for testing each time. We report the average (test) 
accuracy and the corresponding standard deviation in Table 1. Note that we only 
use node attributes in our experiments. In particular, SPI-GCN does not exploit 
node or edge labels of the data sets. When node attributes are not available, we 
use the identity matrix as the feature matrix for each graph. 

We follow the same procedure for DGCNN. We use the authors’ implemen- 
tation? and perform 10-fold cross validation with the recommended values for 
training epochs, learning rate, and SortPooling parameter k, for each data set. 

For PSCN, we report the results from the original paper [17] (for receptive 
field size k = 10) as we could not find an authors’ public implementation of the 
algorithm. The experiments were conducted using a similar procedure as ours. 

For GIN, we also report the published results [25] (GIN-0 in the paper), as 
it was not straightforward to use the authors’ implementation. 


Results. Table 1 shows the results for our algorithm (SPI-GCN), DGcnn [27], 
PSCN [17], and the state-of-the-art GIN [25]. We observe that SPI-GCN is highly 
competitive with other algorithms despite using the same architecture for all 
data sets. The only noticeable exceptions are on the NCI1 and IMDB-BINARY 
data sets, where the best approach (GIN) is up to 1.28 times better. On the 
other hand, SPI-GCN appears to be highly competitive on classification tasks 


5 https://github.com/muhanzhang/pytorch DGCNN. 
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with more than 3 classes (ENZYMES, COIL-RAG, SYNTHIE). The difference in 
accuracy is particularly significant on COIL-RAG (100 classes), where SPI-GCN 
is around 34 times better than DGCNN, suggesting that the features extracted 
by SPI-GCN are more suitable to characterize the graphs at hand. SPI-GCN 
also achieves a very reasonable accuracy on the HYDRIDES data set and is 1.06 
times better than DGCNN on ENZYMES. 

The results in Table 1 show that despite its simplicity, SPI-GCN is com- 
petitive with other practical graph algorithms and, hence, it is a reasonable 
architecture to consider for our next set of experiments involving expressiveness. 


Table 1. Accuracy results for SPI-GCN and three other deep learning methods 
(Dacnn, PSCN, GIN). 


Algorithm SPI-GCN DGCNN PSCN GIN 
MUTAG 84.40 + 8.14 | 86.11 + 7.14 | 88.95 + 4.37 | 89.4 + 5.6 
PTC 56.41 £5.71 | 55.00 + 5.10 | 62.29 + 5.68 | 64.6 + 7.0 
NCI1 64.11 + 2.37 | 72.73 + 1.56 | 76.34 + 1.68 | 82.7 + 1.7 
PROTEINS 72.06 + 3.18 | 72.79 + 3.58 | 75.00 + 2.51 | 76.2 + 2.8 
ENZYMES 50.17 + 5.60 | 47.00 + 8.36 | — = 
IMDB-BINARY | 60.40+ 4.15 | 68.60 + 5.66 71.00 + 2.29 | 75.1 + 5.1 
IMDB-MULTI | 44.13 + 4.61 | 45.20 + 3.75 | 45.23 + 2.84 | 52.3 + 2.8 


COIL-RAG 74.38 + 2.42} 2.21 + 0.33 — = 
SYNTHIE 71.00 + 6.44 | 54.25 + 4.34 | — E 
HYDRIDES 82.75 + 2.67 | — = = 


6.4 Expressiveness Experiments 
Through these experiments, we try to answer the following questions: 


— Do more expressive GNNs perform better on graph classification tasks? That 
is, is the injectivity of a GNN’s graph function the determining factor of its 
performance? 

— Can the performance be explained by another factor? If yes, what is it? 


To this end, we train increasingly injective instances of SPI-GCN on the penal- 
ized cross entropy loss £, (10) by setting the penalty factor a to increasingly 
large values. Then, for each trained instance, we investigate (i) its test accu- 
racy, (ii) its expressiveness €(SPI-GCN) (9), and (iii) the average normalized 
Inter-class Graph Embedding Distance (IGED), defined as the average pairwise 
Euclidean distance between mean graph embeddings taken class-wise divided by 
E(SPI-GCN). Formally: 


mean({||z*% — z% ||2: ¢,c =1,...,C, c#c}) 


= ul 
ewe €(SPI-GCN) i 1) 
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Table 2. Expressiveness experiments results. SPI-GCN is trained on the penalized 
cross entropy loss, Lp, with increasing values of the penalty factor a. For each data set, 
and for each value of a, we report the test accuracy (a), the expressiveness € (SPI-GCN) 


(b), and the IGED (c). Highlighted are the maximal values for each quantity. 


a 0 10-3 1o-1 1 10 
MUTAG 84.40 +8.14 | 84.404 8.14 86.07 + 9.03] 82.56 + 7.33 | 81.45 + 6.68 |(a) 
0.09+0.01 | 0.09+0.01 | 0.12+0.01 | 5.96 1.08 6.32 + 0.76 | (b) 
0.68 +0.16 | 0.68+0.16 | 0.82 0.18 | 1.21 + 0.23| 1.20 + 0.22 | (c) 
PTC 56.41 +5.71 | 54.97 +6.05 | 54.64 + 6.33 | 57.88 + 8.65 58.70 + 7.40|(a) 
0.09 +0.01 | 0.09+0.01 | 0.11 +0.01 | 8.41 3.13 | 9.03 + 2.94 | (b) 
0.26 +0.05 | 0.26+0.05 | 0.26 +0.06 | 0.41 0.22 | 0.42 + 0.22 |(c) 
NCI1 64.11 + 2.37 |64.21 + 2.36 64.01+ 2.87 | 63.48 + 1.36 | 63.19 1.72 |(a) 
0.09 + 0.004} 0.09 0.005) 1.07 0.19 | 16.83 + 0.49 16.91 + 0.52| (b) 
0.18 +0.02 | 0.19+0.03 | 0.59 0.05 | 0.62 + 0.05| 0.62 + 0.05 | (c) 
PROTEINS (72.06 + 3.18 | 71.78 +3.55 | 71.51 +3.26 | 70.97 +3.49 | 71.42 3.23 |(a) 
5.89 + 1.34 | 13.07 + 3.21 35.88 + 4.89|35.88 + 4.89 35.88 + 4.89|(b) 
0.74 +0.09 | 0.74 +0.09 | 0.74 +0.09| 0.74 +0.09 0.74 + 0.09 |(c) 
ENZYMES 50.17 + 5.60 |50.17 + 5.60 | 29.33 + 5.93 | 29.33 + 5.54 | 29.33 + 5.88 |(a) 
0.79+40.21 | 1.85 +0.64 | 23.22 + 2.99 | 23.33 + 3.02 23.35 + 3.01 | (b) 
0.44+ 0.06 | 0.42+0.10 | 0.42+0.10 | 0.42+0.10 | 0.42 0.10 |(c) 
IMDB-BIN. | 60.40 + 4.15 |61.70 + 4.96 | 61.10 + 3.75 | 54.40 + 3.10 | 54.20 4 5.15 |(a) 
0.12 +0.01 | 0.12+0.01 | 0.16 + 0.01 |12.43 + 2.37| 11.70 + 2.89 |(b) 
0.15 + 0.03 | 0.15 + 0.03 0.15 +0.03| 0.12+0.08 | 0.12 0.08 |(c) 
IMDB-MUL.| 44.13 + 4.61 | 44.60 + 5.41 44.80 + 4.51| 39.73 + 4.34 | 38.87 + 4.42 |(a) 
0.08 +0.01 | 0.08+0.01 | 0.64 0.14 |10.38 + 1.05! 9.91 1.15 |(b) 
0.16 + 0.02 | 0.16 + 0.02 0.16 +0.09) 0.15+0.09) 0.15 0.09 |(c) 
COIL-RAG 74.38 + 2.42 |74.38 + 2.45 | 72.49 + 3.21 | 52.08 + 4.89 | 28.72 + 3.62 |(a) 
0.08 + 0.002} 0.081 + 0.002) 0.13 +0.01 | 2.00 0.18 | 2.33 + 0.14 | (b) 
0.95 +0.01 | 0.95 +0.01 | 0.96 +0.01 | 0.98 + 0.02| 0.98 + 0.02 | (c) 
SYNTHIE | 71.00 +6.44 | 71.00 + 6.04 74.00+ 6.44| 73.00 + 7.57 | 73.75 +7.52 |(a) 
1.60 +0.20 | 1.86 +0.24 (29.97 + 2.16| 29.50 + 2.18 | 29.37 + 2.18 |(b) 
0.73 +0.07 | 0.72+0.08 | 0.61 +0.11 | 0.59+0.12 | 0.58 0.12 |(c) 
HYDRIDES | 82.75 + 2.67 | 82.65 + 2.44 83.92 + 4.30| 77.45 + 3.25 | 76.37 + 2.57 |(a) 
0.13 +0.01 | 0.13+0.01 | 1.68 +0.87 | 4.75 0.41 | 5.03 + 0.75 | (b) 
0.50 +0.11 | 0.50 0.11 0.8 +0.19 | 0.85 +0.21. 0.72 +0.22 |(c) 


where z% is the mean graph embedding for class k. The IGED can be interpreted 
as an estimate of how well the graph embeddings computed by SPI-GCN are 
separated with respect to their respective class. 


Experimental Procedure. We train SPI-GCN on the penalized cross entropy 
loss £p (10) where we sequentially choose a from {0,10~?, 1071, 1,10}. We do 
so using full batch ADAM optimizer that we run for 200 epochs with a learning 
rate of 1078, on all the graph data sets introduced previously. For each data set 
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and for each value of a, we perform 10-fold cross validation using 9 folds for 
training and one fold for testing. We report in Table 2 the average and standard 
deviation of: (a) the test accuracy, (b) the expressiveness €(SPI-GCN), and (c) 
the IGED (11), for each value of a and for each data set. 


Results. We observe from Table2 that using a penalty term in £p to maxi- 
mize the expressiveness—or injectivity—of SPI-GCN helps to improve the test 
accuracy on some data sets, notably on MUTAG, PTC, and SYNTHIE. How- 
ever, larger values of E(SPI-GCN) do not correspond to a higher test accu- 
racy except for two cases (PTC, SYNTHIE). Overall, €(SPI-GCN) increases 
when a increases, as expected, since the expressiveness is maximized during 
training when a > 0. The IGED, on the other hand, is correlated to the best 
performance in four out of ten cases (ENZYMES, IMDB-BINARY, and IMDB- 
MULTI), where the test accuracy is maximal when the IGED is maximal. On 
HYDRIDES, the difference in IGED for a = 10~' (highest accuracy) and a = 1 
(highest IGED value) is negligible. 

Our empirical results indicate that while optimizing the expressiveness of 
SPI-GCN may result in a higher test accuracy in some cases, more expressive 
GNNs do not systematically perform better in practice. The IGED, however, 
which reflects a GNN’s ability to compute graph representations that are cor- 
rectly clustered according to their effective class, better explains the generaliza- 
tion performance of the GNN. 


7 Conclusion 


In this paper, we challenged the common belief that more expressive GNNs 
achieve a better performance. We introduced a principled experimental pro- 
cedure to analyze the link between the expressiveness of a GNN and its test 
accuracy in a graph classification setting. To the best of our knowledge, our 
work is the first that explicitly studies the generalization performance of GNNs 
by trying to uncover the factors that control it, and paves the way for more 
theoretical analyses. Interesting directions for future work include the design of 
better expressiveness estimators, as well as different (possibly more complex) 
penalized loss functions. 
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Abstract. Learning from potentially infinite and high-dimensional 
data streams poses significant challenges in the classification task. For 
instance, k-Nearest Neighbors (KNN) is one of the most often used algo- 
rithms in the data stream mining area that proved to be very resource- 
intensive when dealing with high-dimensional spaces. Uniform Manifold 
Approximation and Projection (UMAP) is a novel manifold technique 
and one of the most promising dimension reduction and visualization 
techniques in the non-streaming setting because of its high performance 
in comparison with competitors. However, there is no version of UMAP 
that copes with the challenging context of streams. To overcome these 
restrictions, we propose a batch-incremental approach that pre-processes 
data streams using UMAP, by producing successive embeddings on a 
stream of disjoint batches in order to support an incremental kNN classi- 
fication. Experiments conducted on publicly available synthetic and real- 
world datasets demonstrate the substantial gains that can be achieved 
with our proposal compared to state-of-the-art techniques. 


Keywords: Data stream - k-Nearest Neighbors - Dimension 
reduction - UMAP 


1 Introduction 


With the evolution of technology, several kinds of devices and applications are 
continuously generating large amounts of data in a fast-paced way as streams. 
Hence, the data stream mining area has become indispensable and ubiquitous 
in many real-world applications that require real-time — or near real-time — 
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processing, e.g., social networks, weather forecast, spam filters, and more. Unlike 
traditional datasets, the dynamic environment and the tremendous volume of 
data streams make them impossible to store or to scan multiple times [12]. 

Classification is an active area of research in data stream mining field where 
several researchers are paying attention to develop new — or improve existing 
— algorithms [14]. However, the dynamic nature of data streams has outpaced 
the capability of traditional classification algorithms to process data streams. 
In this context, a multitude of supervised algorithms for static datasets that 
have been widely studied in the offline processing, and proved to be of lim- 
ited effectiveness on large data, have been extended to work within a stream- 
ing framework [3,5,11,18]. Data stream mining approaches can be divided into 
two main types [23]: (i) instance-incremental approaches which update the 
model with each instance as soon as it arrives, such as Self-Adjusting Memory 
kNN (samkNN) [18], and Hoeffding Adaptive Tree (HAT) [4]; and (ii) batch- 
incremental approaches which make no change/increment to their model until 
a batch is completed, e.g., support vector machines [10], and batch-incremental 
ensemble of decision trees [15]. Nevertheless, the high dimensionality of data 
complicates the classification task for some algorithms and increases their com- 
putational resources, most notably the k-Nearest Neighbors (KNN) because it 
needs the entire dataset to predict the labels for test instances [23]. To cope with 
this issue, a promising approach is feature transformation which transforms the 
input features into a new set of features, containing the most relevant compo- 
nents, in some lower-dimensional space. 

In attempt to improve the performance of kKNN, we incorporate a 
batch-incremental feature transformation strategy to tackle potentially high- 
dimensional and possibly infinite batches of evolving data streams while ensur- 
ing effectiveness and quality of learning (e.g. accuracy). This is achieved via 
a new manifold technique that has attracted a lot of attention recently: Uni- 
form Manifold Approximation and Projection (UMAP) [21], built upon rigorous 
mathematical foundations, namely Riemannian geometry. To the best of our 
knowledge, no incremental version of UMAP exists which makes it not applica- 
ble on large datasets. The main contributions are summarized as follows: 


— Batch-Incremental UMAP: a new batch-incremental novel manifold learning 
technique, based on extending the UMAP algorithm to data streams. 

— UMAP-kNearest Neighbors (UMAP-kNN): a new batch-incremental KNN 
algorithm for data streams classification using UMAP. 

— Empirical experiments: we provide an experimental study, on various 
datasets, that discusses the implication of parameters on the algorithms per- 
formance; 


The paper is organized as follows. Section 2 reviews the prominent related 
work. Section 3 provides the background of UMAP, followed by the description 
of our approach. In Sect. 4 we present and discuss the results of experiments on 
diverse datasets. Finally, we draw our conclusions and present future directions. 
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2 Related Work 


Dimensionality reduction (DR) is a powerful tool in data science to look for 
hidden structure in data and reduce the resources usage of learning algorithms. 
The problem of dimensionality has been widely studied [25] and used throughout 
different domains, such as image processing and face recognition. Dimensionality 
reduction techniques facilitate the classification task, by removing redundancies 
and extracting the most relevant features in the data, and permits a better data 
visualization. A common taxonomy divides these approaches into two major 
groups: matrix factorization and graph-based approaches. 

Matrix factorization algorithms require matrix computation tools, such as 
Principal Components Analysis (PCA) [16]. It is a well-known linear technique 
that uses singular value decomposition and aims to find a lower-dimensional basis 
by converting the data into features called principal components by computing 
the eigenvalues and eigenvectors of a covariance matrix. This straightforward 
technique is computationally cheap but ineffective with data streams since it 
relies on the whole dataset. Therefore, some incremental versions of PCA have 
been developed to handle streams of data [13, 24,26]. 

Graph/Neighborhood-based techniques are leveraged in the context of dimen- 
sion reduction and visualization by using the insight that similar instances in a 
large space should be represented by close instances in a low-dimensional space, 
whereas dissimilar instances should be well separated. t-distributed Stochastic 
Neighbor Embedding (tSNE) [20] is one of the most prominent DR techniques in 
the literature. It has been proposed to visualize high-dimensional data embed- 
ded in a lower space (typically 2 or 3 dimensions). In addition to the fact that 
it is computationally expensive, tSNE does not preserve distances between all 
instances and can affect any density—or distance-based algorithm and hence con- 
serves more of the local structure than the global structure. 


3 Batch-Incremental Classification 


In the following, we assume a data stream S is a sequence of instances 
X 1,...,Xn, where N denotes the number of available observations so far. Each 
instance X; is composed of a vector of d attributes X; = (z},...,2%). The 
dimensionality reduction of S comprises the process of finding a low-dimensional 
presentation S” = Y1, ..., Yy, where Y; = (y},...,y?) andp<d. 


3.1 Prior Work 


Unlike tSNE [20], UMAP has no restriction on the projected space size making it 
useful not only for visualization but also as a general dimension reduction tech- 
nique for machine learning algorithms. It starts by constructing open balls over 
all instances and building simplicial complexes. Dimension reduction is obtained 
by finding a representation, in a lower space, that closely resembles the topo- 
logical structure in the original space. Given the new dimension, an equivalent 
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(a) (e) (£) 


Fig. 1. Projection of CNAE dataset in 2-dimensional space. Offline: (a) UMAP, (b) 
tSNE, and (c) PCA. Batch-incremental: (d) UMAP, (e) tSNE, and (f) PCA. (Color 
figure online) 


fuzzy topological representation is then constructed [21]. Then, UMAP optimizes 
it by minimizing the cross-entropy between these two fuzzy topological represen- 
tations. UMAP offers better visualization quality than tSNE by preserving more 
of the global structure in a shorter running time. To the best of our knowledge, 
none of these techniques has a streaming version. Ultimately, both techniques 
are essentially transductive! and do not learn a mapping function from the input 
space. Hence, they need to process all the data for each new unseen instance, 
which prevents them from being usable in data streams classification models. 
Figurel shows the projection of CNAE dataset (see Tablel) into 2- 
dimensions in an offline/online fashions where each color represents a label. In 
Fig. la, we note that UMAP offers the most interesting visualization while sep- 
arating classes (9 classes). The overlap in the new space, for instance with tSNE 
in Fig. 1b, can potentially affect later classification task, notably distance-based 
algorithms, since properties like global distances and density may be lost. On the 
other hand, linear transformation, such as PCA, cannot discriminate between 
instances which prevents them from being represented in the form of clusters 
(Fig. 1c). To motivate our choice, we project the same dataset using our batch- 


1 Transductive learning consists on learning on a full given dataset (including unknown 
label), but prediction is only made on the known set of unlabeled instances from the 
same dataset. This is achieved by clustering data instances. 
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incremental strategy (more details in Sect. 3.2). Figure 1d illustrates the change 
from the offline UMAP representation; this is not as drastic as the ones engen- 
dered by tSNE and PCA (Figs. le and f, respectively) showing their limits on 
capturing information from data that arrives in a batch-incremental manner. 


3.2 Algorithm Description 


A very efficient and simple scheme in supervised learning is lazy learning [1]. 
Since lazy learning approaches are based on distances between every pair of 
instances, they unfortunately have a low performance in terms of execution time. 
The k-Nearest Neighbors (kNN) is a well-known lazy algorithm that does not 
require any work during training, so it uses the entire dataset to predict labels for 
test instances. However, it is impossible to store an evolving data stream which 
is potentially infinite — nor to scan it multiple times — due to its tremendous 
volume. To tackle this challenge, a basic incremental version of kNN has been 
proposed which uses a fixed-length window that slides through the stream and 
merges new arriving instances with the closest ones already in the window [23]. 

To predict the class label for an incoming instance, we take the majority class 
labels of its nearest neighbors inside the window using a defined distance metric 
(Eq. 2). Since we keep the recent arrived instances inside the sliding window for 
prediction, the search for the nearest neighbors is still costly in terms of memory 
and time [3,7] and high-dimensional streams require further resources. 

Given a window w, the distance between X; and X; is defined as follows: 


Dx; (X) = y lX: - Xl. (1) 


Similarly, the k-Nearest Neighbors distance is defined as follows: 
Digi) = ' e ye jar PX), (2) 


where ia denotes the subset of the KNN to X; in w. 

When dealing with high-dimensional data, a pre-processing phase before 
applying a learning algorithm is imperative to avoid the curse of dimension- 
ality from a computational point of view. The latter may increase the resources 
usage and decrease the performance of some algorithms, such as kNN. The main 
idea to mitigate this curse consists of using an efficient strategy with consistent 
and promising results such as UMAP. 

Since UMAP is a transductive technique, an instance-incremental learning 
approach that includes UMAP does not work because the entire stream needs to 
be processed for each new incoming instance. By doing it this way, the process 
will be costly and will not respond to the streaming requirements. To alleviate the 
processing cost considering the framework within which several challenges shall 
be respected, including the memory constraint and the incremental behavior of 
data, we adopt a batch-incremental strategy. In the following, we introduce the 
procedure of our novel approach, batch-incremental UMAP-kKNN. 
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Fig. 2. Batch-incremental UMAP-kNN scheme 


Step 1: Partition of the Stream. During this step, we assume that data 
arrive in batches — or chunks — by dividing the stream into disjoint partitions 
S1,59,... of size s. The first part of Fig.2 shows a stream of instances divided 
into batches, so instead of having instances available one at a time, they will 
arrive as a group of instances simultaneously, S1, ..., Sq, where Sq is the qth 
chunk. A simple example of data stream is a video sequence where at each 
instant we have a succession of images. 


Step 2: Data Pre-processing. We aim to construct a low-dimensional Y; € p, 
from an infinite stream of high-dimensional data X; € d, where p < d. As men- 
tioned before, UMAP is unable to compress data incrementally and needs to 
transform more than one observation at a time because it builds a neighborhood- 
graph on a set of instances and then lays it out in a lower dimensional space [21]. 
Thus, our proposed approach operates on batches of the stream where a single 
batch S; of data is processed at a time T;. The two first steps in Fig. 2 depict the 
application of UMAP on the disjoint batches. Once a batch is complete, through- 
out the second step, we apply UMAP on it independently from the chunks that 
have been already processed, so each S; € RÊ will be transformed and repre- 
sented by S € R?. This new representation is very likely devoid of redundan- 
cies, irrelevant attributes, and is obtained by finding potentially useful non-linear 
combinations of existing attributes, i.e. by repacking relevant information of the 
larger feature space and encoding it more compactly. 

For UMAP to learn when moving from a batch to another, we seed each 
chunk’s embedding with the outcome of the previous one, i.e., match the prior 
initial coordinates for instances in the current embedding to the final coordinates 
in the preceding one. This will help to avoid losing the topological information of 
the stream and to keep stability in successive embeddings as we transition from 
one batch to its successor. Afterwards, we use the compressed representation 
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of the high-dimensional chunk for the next step that consists in supporting the 
incremental ANN classification algorithm. 


Step 3: kNN Classification. UMAP-kNN aims to decrease the computational 
costs of KNN on high-dimensional data stream by reducing the input space size 
using the dimension reducing UMAP, in a batch-incremental way. In addition to 
the prediction phase of the kNN algorithm that, based on the neighborhood?, 
UMAP operates on a k-nearest graph (topological representation) as well and 
optimizes the low-dimensional representation of the data using gradient descent. 
One nice takeaway is that UMAP, because of its solid theoretical backing as 
a manifold technique, keeps properties such as density and pairwise distances. 
Thus, it does not bias the neighborhood-based kNN performance. 

This step consists of classifying the evolving data stream, where the learn- 
ing task occurs on consecutive batches, i.e. we train incrementally KNN with 
instances becoming successively available in chunk buffers after pre-processing. 
Figure 2 shows the underlying batch-incremental learning scheme used which 
is built upon the divide-and-conquer strategy. Since UMAP is independently 
applied to batches, so once a chunk is complete and has been transformed in R?, 
we feed the half of the batch to the sliding window and we predict incrementally 
the class label for the second half (the rest of instances). 

Given that kNN is adaptive, the main novelty of UMAP-ENN is in how it 
merges the current batch to previous ones. This is done by adding it to the 
instances from previous chunks inside the KNN window. Even if past chunks 
have been discarded, only some of them have been stored and maintained while 
the adaptive window scrolls. Thereafter, instances kept temporarily inside the 
window are going to be used to define the neighborhood and predict the class 
labels for later incoming instances. As presented in Fig. 2, the intuitive idea to 
combine results from different batches is to use the half of each batch for training 
and the second half for prediction. In general, due to the possibility of having 
often very different successive embeddings, one would expect that this may affect 
the global performance of our approach. Thus, we adopt this scheme to maintain 
a stability over an adaptive batch-incremental manifold classification approach. 


4 Experiments 


In this section, we present a series of experiments carried out on various datasets 
based on three main results: the accuracy, the memory (MB), and the time (Sec). 
4.1 Datasets 


We use 3 synthetic and 6 real-world datasets from different scenarios that have 
been thoroughly used in the literature to evaluate the performance of data 


? The distances between the new incoming instance and the instances already available 
inside the adaptive window are computed in order to assign it to a particular class. 
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streams classifiers. Table 1 presents a short description of each dataset, and fur- 
ther details are provided in what follows. 


Tweets. The dataset was created using the tweets text data generator provided 
by MOA [6] that simulates sentiment analysis on tweets, where messages can 
be classified depending on whether they convey positive or negative feelings. 
Tweets, 2,3 produce instances of 500, 1,000 and 1,500 attributes respectively. 


Har. Human Activity Recognition dataset [2] built from several subjects per- 
forming daily living activities, such as walking, sitting, standing and laying, while 
wearing a waist-mounted smartphone equipped with sensors. The sensor signals 
were pre-processed using noise filters and attributes were normalized. 


CNAE. CNAE is the national classification of economic activities dataset [9]. 
Instances represent descriptions of Brazilian companies categorized into 9 classes. 
The original texts were pre-processed to obtain the current highly sparse data. 


Enron. The Enron corpus dataset is a large set of email messages that was made 
public during the legal investigation concerning the Enron corporation [17]. This 
cleaned version of Enron consists of 1,702 instances and 1,000 attributes. 


Table 1. Overview of the datasets 


Dataset | #Instances | #Attributes | #Classes | Type 
Tweets, | 1,000,000 500 2 Synthetic 
Tweets2 | 1,000,000 1,000 2 Synthetic 
Tweets | 1,000,000 1,500 2 Synthetic 
Har 10,299 561 6 Real 
CNAE 1,080 856 9 Real 
Enron 1,702 1,000 2 Real 
IMDB 120,919 1,001 2 Real 
Nomao 34,465 119 2 Real 
Covt 581,012 54 7 Real 


IMDB. IMDB movie reviews dataset was proposed for sentiment analysis [19], 
where each review is encoded as a sequence of word indexes (integers). 


Nomao. Nomao dataset [8] was provided by Nomao Labs where data come from 
several sources on the web about places (name, address, localization, etc.). 


Covt. The forest covertype dataset obtained from US forest service resource 
information system data where each class label presents a different cover type. 
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Fig. 3. (a) Varying the chunk size. (b) Varying the neighborhood size for UMAP. 


4.2 Results and Discussions 


We compare our proposed classifier, UMAP-kNN, to various commonly-used 
baseline methods in dimensionality reduction and machine learning areas. 
PCA [24], tSNE (fixing the perplexity to 30, which is the best value as reported 
in [20]), SAM-ANN (SENN) [18]. We use HAT, a classifier with a different struc- 
ture based on trees [4], to assess its performance with the neighborhood-based 
UMAP. For fair comparison, we compare the performance of UMAP-kNN app- 
roach with a competitor using UMAP as well in the same batch-incremental 
manner. Actually, incremental kNN has two crucial parameters: (i) the num- 
ber of neighbors k fixed to 5; and (ii) the window size w, that maintains the 
low-dimensional data, fixed to 1000. According to previous studies such as [7], a 
bigger window will increase the resources usage and smaller size will impact the 
accuracy. 

The experiments were conducted on a machine equipped with an Intel Core 
i5 CPU and 4 GB of RAM. All experiments were implemented and evaluated in 
Python by extending the Scikit-multiflow framework? [22]. 

Figure 3a depicts the influence of the chunk size on the accuracy using 
UMAP-ENN with some datasets. Generally, fixing the chunk size imposes the 
following dilemma: choosing a small size so that we obtain an accurate reflection 
of the current data or choosing a large size that may increase the accuracy since 
more data are available. The ideal would be to use a batch with the maximum of 
instances to represent as possible the whole stream. In practice, the chunk size 
needs to be small enough to fit in the main memory otherwise the running time 
of the approach will increase. Since UMAP is relatively slow, we choose small 
chunk sizes to overcome this issue with UMAP-kNN. Based on the obtained 
results, we fix the chunk size to 400 for the best trade-off accuracy-memory. 

We investigate the behavior of a crucial parameter that controls UMAP, 
number of neighbors, via the classification performance of our approach. Based 


3 https: //scikit-multiflow.github.io/. 
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Fig. 4. Comparison of UMAP-kNN, tSNE-kNN, PCA-kNN, and kNN (with the entire 
datasets) while projecting into 3-dimensions: (a) Accuracy. (b) Memory. 


on the size of the neighborhood, UMAP constructs the manifold and focuses on 
preserving local and global structures. Figure 3b shows the accuracy when the 
number of neighbors is varied on diverse datasets. We notice that for all datasets, 
the accuracy is consistently the same with no large differences, e.g. Har. Since 
a large neighborhood leads to a slower learning process, in the following we fix 
the neighborhood size to 15. 

tSNE is a visualization technique, so we are limited to project high- 
dimensional data into 2 or 3 dimensions. In order to evaluate the performance 
of our proposal in a fair comparison against each of tSNE-kKNN and PCA-KNN, 
we project data into 3-dimensional space. We illustrate in Fig. 4a that UMAP- 
kNN makes significantly more accurate predictions beating consistently the best 
performing baselines (tSNE-KNN and PCA-kNN) notably with CNAE and the 
tweets datasets. Figure 4b depicts the quantity of memory needed by the three 
algorithms which is practically the same for some datasets. Compared to kNN 
that uses the whole data without projection, we notice that UMAP-kNN con- 
sumes much less memory whilst sacrificing a bit in accuracy because we are 
removing many attributes. Figure4c shows that our approach is consistently 
faster than tSNE-kNN because tSNE computes the distances between every pair 
of instances to project. But PCA-KNN is a bit faster thanks to the simplicity of 
PCA. But with this trade-off our approach performs good on almost all datasets. 

In addition to its good classification performance in comparison with com- 
petitors, the batch-incremental UMAP-kNN did a better job of preserving den- 
sity by capturing both of global and local structures, as shown in Fig. 1d. The 
fact that UMAP and &NN are both neighborhood-based methods arises as a 
key element in achieving a good accuracy. UMAP has not only the power of 
visualization but also the ability to reduce the dimensionality of data efficiently 
which makes it useful as pre-processing technique for machine learning. 

Table 2 reports the comparison of UMAP-KNN against state-of-the-art clas- 
sifiers. We highlight that our approach performs better on almost all datasets. It 
achieves similar accuracies to UMAP-SKNN on several datasets but in terms 
of resources, the latter is slower because of its drift detection mechanism. 
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Table 2. Comparison of UMAP-kNN, PCA-kKNN, UMAP-SENN, and UMAP-HAT. 


Dataset | UMAP-kNN | PCA-kKNN | UMAP-SENN | UMAP-HAT 
Accuracy (%) 

Tweets; 75.71 69.89 75.37 66.47 
Tweets 75.16 69.21 74.40 61.27 
Tweets3 71.01 70.81 70.47 66.98 
Har 75.30 70.50 64.09 84.89 
CNAE 76.67 67.41 75.18 40.18 
Enron 92.24 93.41 91.89 91.77 
IMDB 67.38 67.28 67.43 64.52 
Nomao 91.92 91.13 91.63 83.75 
Covt 61.29 66.73 53.08 55.43 
Memory (MB) 

Tweets, | 1366.71 1354.24 1373.15 2738.32 
Tweetsz | 2530.30 2518.76 2532.95 4891.23 
Tweets3 | 3706.99 3706.55 3722.68 7144.77 
Har 311.58 310.48 312.84 381.49 
CNAE 254.17 246.94 260.29 262.52 
Enron 269.00 267.31 271.56 288.74 
IMDB | 3012.85 3013.28 3018.04 7471.64 
Nomao | 289.81 285.50 290.60 508.50 
Covt 700.69 689.97 704.46 3788.54 
Time (Sec) 

Tweets; | 558.56 217.44 1396.32 2163.14 
Tweets2 | 616.50 350.63 908.59 3453.21 
Tweets3 | 667.43 400.62 1066.98 6273.19 
Har 75.20 24.37 77.99 82.47 
CNAE 8.89 4.81 13.17 19.78 
Enron 12.80 9.52 17.26 32.84 
IMDB 715.68 407.60 1038.77 4691.07 
Nomao | 248.79 20.46 327.36 228.00 
Covt 2311.21 137.62 3756.41 2297.01 


UMAP-ENN has a better performance than PCA-kNN, e.g. the Tweets datasets 
at the cost of being slower. We also observe the UMAP-HAT failed to overcome 
our approach (in terms of accuracy, memory, and time) due to the integration 
of a neighborhood-based technique (UMAP) to a tree structure (HAT). 
Figure 5 reports detailed results for Tweet; dataset with five output dimen- 
sions. Figure 5a exhibits the accuracy of our approach which is consistently above 
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Fig. 5. Comparison of UMAP-kNN, PCA-kNN, UMAP-SkNN, and UMAP-HAT over 
different output dimensions on Tweet: (a) Accuracy. (b) Memory. (c) Time. 


competitors whilst ensuring stability for different manifolds. Figures 5b and c 
show that kNN-based classifiers use much less resources than the tree-based 
UMAP-HAT. We see that UMAP-KNN requires less time than UMAP-HAT and 
UMAP-SENN to execute the stream but PCA-kNN is fastest thanks to its sim- 
plicity. Still, the gain in accuracy with UMAP-ENN is more significant. 


5 Concluding Remarks and Future Work 


In this paper, we presented a novel batch-incremental approach for mining data 
streams using the kNN algorithm. UMAP-kNN combines the simplicity of kNN 
and the high performance of UMAP which is used as an internal pre-processing 
step to reduce the feature space of data streams. We showed that UMAP is 
capable of embedding efficiently data streams within a batch-incremental strat- 
egy in an extensive evaluation with well-known state-of-the-art algorithms using 
various datasets. We further demonstrated that the batch-incremental approach 
is just as effective as the offline approach in visualization and its accuracy out- 
performs reputed baselines while using reasonable resources. 

We would like to pursue our promising approach further to enhance its run- 
time performance by applying a fast dimension reduction before using of UMAP. 
Another area for future work could be the use of a different mechanism, such 
as the application of UMAP for each incoming data inside a sliding window. 
We believe that this may be slow but will be suited for instance-incremental 
learning. 
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Abstract. Many graph pattern mining algorithms have been designed 
to identify recurring structures in graphs. The main drawback of these 
approaches is that they often extract too many patterns for human anal- 
ysis. Recently, pattern mining methods using the Minimum Description 
Length (MDL) principle have been proposed to select a characteristic 
subset of patterns from transactional, sequential and relational data. In 
this paper, we propose an MDL-based approach for selecting a character- 
istic subset of patterns on labeled graphs. A key notion in this paper is 
the introduction of ports to encode connections between pattern occur- 
rences without any loss of information. Experiments show that the num- 
ber of patterns is drastically reduced. The selected patterns have complex 
shapes and are representative of the data. 


Keywords: Pattern mining - Graph mining - Minimum Description 
Length 


1 Introduction 


Many fields have complex data that need labeled graphs, i.e. graphs where ver- 
tices and edges have labels, for an accurate representation. For instance, in chem- 
istry and biology, molecules are represented as atoms and bonds; in linguistics, 
sentences are represented as words and dependency links; in the semantic web, 
knowledge graphs are represented as entities and relationships. Depending on 
the domain, graph datasets can be made of large graphs or large collections 
of graphs. Graphs are complex to analyze in order to extract knowledge, for 
instance to identify frequent structures in order to make them more intelligible. 

In the field of pattern mining, there has been a number of proposals, namely 
graph mining approaches, to extract frequent subgraphs. Classical approaches 
to graph mining, e.g. gSpan [12] and Gaston |7], work on collections of graphs, 
and generate all patterns w.r.t. a frequency threshold. The major drawback of 
this kind of approach is the huge amount of generated patterns, which ren- 
ders them difficult to analyze. Some approaches such as CloseGraph [13] reduce 
the number of patterns by only generating closed patterns. However, the set of 
closed patterns generally remains too large, with a lot of redundancy between 
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patterns. Constraint-based approaches, such as gPrune [14], reduce the num- 
ber of extracted patterns by extracting only the patterns following a certain 
acceptance rule. These algorithms generally manage to reduce the number of 
patterns, however they also limit their type. Additionally, if the acceptance rule 
is user-provided, the user needs some background knowledge on the data. 

More effective approaches to reduce the number of patterns are those based 
on the Minimum Description Length (MDL) principle [3]. The MDL principle 
comes from information theory, and states that the model that describes the 
data the best is the one that compresses the data the best. It has been shown 
on sets of items [10], sequences [9] and relations [4] that an MDL-based app- 
roach can select a small and descriptive subset of patterns. Few MDL-based 
approaches have been proposed for graphs. SUBDUE [1] iteratively compresses 
a graph by replacing each occurrence of a pattern by a single vertex. At each 
step, the chosen pattern is the one that compresses the most. The drawback of 
SUBDUE is that the replacement of pattern occurrences by vertices entails a loss 
of information. VoG [5] summarizes graphs as a composition of predefined fam- 
ilies of patterns (e.g., paths, stars). Like SUBDUE, VoG aims to only extract 
“interesting” patterns, but instead of evaluating each pattern individually like 
SUBDUE, it evaluates the set of extracted patterns as a whole. This allows the 
algorithm to find a “good set of patterns” instead of a “set of good patterns”. 
One limitation of VoG is that the type of patterns is restricted to predefined 
ones. Another limitation is that VoG works on unlabeled graphs, (e.g. network 
graphs), while we are interested in labeled graphs. 

The contribution of this paper (Sect.3) is a novel approach called GRAPH- 
MDL, leveraging the MDL principle to select graph patterns from labeled 
graphs. Contrary to SUBDUE, GRAPHMDL ensures that there is no loss of 
information thanks to the introduction of the notion of ports associated to graph 
patterns. Ports represent how adjacent occurrences of patterns are connected. 
We evaluate our approach experimentally (Sect. 4) on two datasets with differ- 
ent kinds of graphs: one on AIDS-related molecules (few labels, many cycles), 
and the other one on dependency trees (many labels, no cycles). Experiments 
validate our approach by showing that the data can be significantly compressed, 
and that the number of selected patterns is drastically reduced compared to the 
number of candidate patterns. More so, we observe that the patterns can have 
complex and varied shapes, and are representative of the data. 


2 Background Knowledge 


2.1 The MDL Principle 


The Minimum Description Length (MDL) principle [3] is a technique from the 
domain of information theory that allows to select the model, from a family of 
models, that best describes some data. The MDL principle states that the best 
model M for describing some data D is the one that minimizes the description 
length L(M, D) = L(M)+L(D|M), where L(M) is the length of the model and 
L(D|M) the length of the data encoded with the model. The MDL principle does 
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Fig. 1. A labeled undi- Fig. 2. Embeddings of a pattern in the Fig. 3. Two single- 
rected simple graph. graph of Fig. 1. ton patterns. 


not define how to compute every possible description length. However, common 
primitives exist for data and distributions [6]: 


— An element x € X with uniform distribution has a code of log(|A’|) bits. 
— An element x € X, appearing usage(x, D) times in some data D has a code 


of Låsagel£, D) = —log (z anne D) ) bits. This encoding is optimal. 


zex usage(xi, D) 
— An integer n € N without a known upper bound can be encoded with a 
universal integer encoding, whose size in bits is noted Ly(n)t. 


Description lengths of elements that are common to all models are usually 
ignored, since they do not affect their comparison. 

Krimp [10] is a pattern mining algorithm using the MDL principle to select a 
“characteristic” set of itemset patterns from a transactional database. Because of 
its good performances, Krimp has been adapted to other types of data, such as 
sequences [9] and relational databases [4]. In our approach we redefine Krimp’s 
key concepts on graphs, in order to apply a Krimp-like approach to graph mining. 


2.2 Graphs and Graph Patterns 


Definition 1. A labeled graph G = (V, E,ly,lz) over two label sets Ly and Lg 
is a data structure composed of a set of vertices V, a set of edges EC V xV, 
and two labeling functions ly € V —> 2£V and lg € E —> Lp that associate a 
set of labels to vertices, and one label to edges. 

G is said undirected if E is symmetric, and simple if E is irreflexive. 


Although our approach applies to all labeled graphs, in the following we 
only consider undirected simple graphs, so as to compare ourselves with existing 
tools and benchmarks. Figure 1 shows an example of graph, with 8 vertices and 
7 edges, defined over vertex label set {W, X, Y, Z} and edge label set {a,b}. In 
our definition vertices can have several or no labels, unlike usual definitions in 
graph mining, because it makes it applicable to more datasets. 


1 In our implementation we use Elias gamma encoding [2], shifted by 1 so that it can 
encode 0. Therefore Ly(n) = 2|log(n + 1)| +1. 
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P G oP. “n T 
Pattern structure Pattern usage Pattern code Pattem code Port count  PortID Portusage  Portcode "O't code 
length (bits) length (bits) 
P1 X aY b Z A PT 1 5 vi 1 LT] 2 
OOO v2 3 0.42 
a vi 1 EAN] 
P 1 P 2.58 2 1 
$ O—O i v2 1 1 
Ww 
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Px | @* 1 Px 2.58 1 vi 1 o 
Fig. 4. Example of a GRAPHMDL code table over the graph of Fig. 1. Pattern and 


port usages, and code lengths have been added as illustration and are not part of the 
table definition. Unused singleton patterns are omitted. 


Definition 2. Let G? and GP be graphs. An embedding (or occurrence) of GP 
in GP is an injective function e € VP + VP such that: (1) IE (v) C 18 (e(v)) for 
all v € VP; (2) (e(u),e(v)) € EP for all (u,v) € EP; and (3) lE(e) = IR (e(e)) 
for alle € EP. 


We define graph patterns as graphs G? having some occurrences in the data 
graph GP. Figure 2 shows the three embeddings £1, €2, €3 of a two-vertices graph 
pattern into the graph of Fig. 1. We define singleton patterns as the elementary 
patterns. A vertex singleton pattern is a graph with one vertex having one label. 
An edge singleton pattern is a graph with two unlabeled vertices, connected by 
a single labeled edge. Figure 3 shows examples of singleton patterns. 


3 GRAPHMDL: MDL for Graphs 


In this section we present our contribution: the GRAPHMDL approach. This 
approach takes as input a graph—the original graph G°—and a set of pat- 
terns extracted from that graph—the candidate patterns—and outputs the most 
descriptive subset of candidate patterns according to the MDL principle. The 
candidates can be generated with any graph mining algorithm, e.g. gSpan [12]. 

The intuition behind GRAPHMDL is that since data and patterns are both 
graphs, the data can be seen as a composition of pattern embeddings. Informally, 
we want a user analyzing the output of GRAPHMDL to be able to say “the data 
is composed of one occurrence of pattern A, connected to one occurrence of 
pattern B, which is itself connected to one occurrence of pattern C”. More so, 
we want the user to be able to tell how these structures are connected together: 
which vertices of each pattern are used to connect it to other patterns. 


3.1 Model: A Code Table for Graph Patterns 


Similarly to Krimp [10], we define our model as a Code Table (CT), i.e. a set P of 
patterns with associated coding information. A first difference with Krimp is that 
the patterns are graph patterns. A second difference is the need for additional 
coding information: a single code would not suffice since all the information 
related to connectivity between pattern occurrences would be lost. 


58 F. Bariatti et al. 


Pi 


P1 


a) Pattern occurrences b) Rewritten graph Px 


Fig. 5. How the data graph of Fig.l is encoded with the code table of Fig. 4. 
(a) Retained occurrences of CT patterns. (b) The rewritten graph. Blue squares are 
pattern embeddings (their label indicates the pattern), white circles are port vertices. 
Edge labels represent which pattern port correspond to each port vertex. (Color figure 
online) 


We therefore introduce the notion of ports in order to represent how pattern 
embeddings connect to each other to form the original graph. The set of ports of 
a pattern is a subset of the vertices of the pattern. Intuitively, a pattern vertex 
is a port if at least one pattern embedding maps this vertex to a vertex in the 
original graph that is also used by another embedding (be it of the same pattern 
or a different one). For example, in Fig. 5a the three occurrences of pattern P1 
are inter-connected through their middle vertex: this vertex is a port. Since port 
information increases the description length, we expect our approach to select 
patterns with few ports. 

Figure 4 shows an example of CT associated to the graph of Fig. 1. Every 
row of the CT is composed of three parts, and contains information about a 
pattern P € P (e.g. the first row contains information about pattern P1). The 
first part of a row is the graph G? , which represents the structure of the pattern 
(e.g. P1 is a pattern with three labeled vertices and two labeled edges). The 
second part of a row is the code cp, associated to the pattern. The third part 
of a row is the description of the port set of the pattern, Hp, (e.g. P1 has two 
ports, its first two vertices, with codes of 2 and 0.42 bits”). We note JT the set of 
all ports of all patterns. Like Krimp, the length of the code of a pattern or port 
depends on its usage in the encoding of the data, i.e. how many times it is used 
to describe the original graph G° (e.g. P1 has a code of 1 bit because it is used 
3 times and the sum of pattern usages in the CT is 6, see Sects. 3.2 and 3.3). 


3.2 Encoding the Data with a Code Table 


The intuition behind GRAPHMDL is that we can represent the original graph G° 
(i.e. the data) as a set of pattern occurrences, connected via ports. Encoding the 
data with a CT consists in creating a structure that explicits which occurrences 
are used and how they interconnect to form the original graph. We call this 
structure the rewritten graph G". 


? MDL approaches deal with theoretical code lengths, which may not be integers. 
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Definition 3. A rewritten graph G” = (V", E" , lọ, lp) is a graph where the set 
of vertices is V” = VI, U Vort: Vemp tS the set of pattern embedding vertices 


m port > Vemb 


and Vport 18 the set of port vertices. E” C Vimo X Vport 18 the set of edges from 
embeddings to ports, ly, € Vina > P and lp E€ E" — IT are the labelings. 


In order to compute the encoding of the data graph with a given CT, we start 
with an empty rewritten graph. One after another, we select patterns from the 
CT. For each pattern, we compute the occurrences of its graph G”. Similarly to 
Krimp, we limit embeddings overlaps: we admit overlap on vertices (since it is 
the key notion behind ports), but we forbid edge overlaps. 

Each retained embedding is represented in the rewritten graph by a pat- 
tern embedding vertex: a vertex ve € V2, with a label P € P indicating 
which pattern it instantiates. Vertices that are shared by several embeddings 
are represented in the rewritten graph by a port verter vp € Vport: We add an 
edge (ve, Up) E E” between the pattern embedding vertex ve of a pattern P and 
the port vertex vp, when the embedding associated to ve maps the pattern’s 
port Vr € Hp to vp. We label this edge vy. 

We make sure that code tables always include all singleton patterns, so that 
they can always encode any vertex and edge of the original graph. 

Figure 5 shows the graph of Fig. 1 encoded with the CT of Fig. 4. Embeddings 
of CT patterns become pattern embedding vertices in the rewritten graph (blue 
squares). Vertices that are at the boundary between multiple embeddings become 
port vertices in the rewritten graph (white circles). When an embedding has a 
port, its pattern embedding vertex in the rewritten graph is connected to the 
corresponding port vertex and the edge label indicates which pattern’s port it 
is. For instance, the three retained occurrences of pattern P1 all share the same 
vertex labeled Y (middle of the original graph), thus in the rewritten graph the 
three corresponding pattern embedding vertices are connected to the same port 
vertex via port v2. 


3.3 Description Lengths 


In this section we define how to compute the description length of the CT and 
the rewritten graph. Description lengths are used to compare CTs. Formulas are 
explained below and grouped in Fig. 6. 


Code Table. The description length L(M) = L(CT) of a CT is the sum of the 
description lengths of its rows (skipping rows with unused patterns), and every 
row is composed of three parts: the pattern graph structure, the pattern code, 
and the pattern port description. 

To describe the structure G = G? of a pattern (L(G)) we start by encoding 
the number of vertices of the pattern. Then we encode the vertices one after 
the other. For each vertex v, we encode its labels then its adjacent edges. To 
encode the vertex labels (Ly (v, G)) we specify their number first, then the labels 
themselves. To encode the adjacent edges (Lz(v,G)) we specify their number 
(between 0 and |V| — 1 in a simple graph), then for each edge, its destination 
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L(cp) = Lisage(P,G") where usage(P;, G") = |{ve € Vime | lọ (ve) = Pi} 
Llen P) = LUE ge(T, G") where usage(m,G’) = He € Erme | lp (e) = mi} 


L(M)=L(CT)= SY) L(G) +L(cr)+ Lp) 
PEP — m 
usage(P)#0 structure code ports 


L(G)= INV) + evlLv,G) + Lel, G) | 
—S —>SS =_— ee 
vertex count vertex ere edges of vertex 
Ly (v, G) = En([lv (0) |) + Vrety w) Lusage(l, G°) 
ee ~—S ee —~—_ 


label count label code 


Lg(v, G) = log(|V]) + Xo w)jezlw<wl log(|V 1) + Likge(la(v, w), G°)] 
SS ee 


edge count destination label 
V 
Llp) = log(|V| +1) +og(| Vl h+ 5 Llen P) 
—-——" |JTp| reie a 
port count |p| wn F port code 
port ids 
L(D|M) = L(G") = Ln(|Vportl) + XO Lemo(v,P,G") with P=1%(v) 
— vEVImb 
port vertex count em 
Lemb(v, P,G”) = L(cp) + log(|ITp| +1) +30 w wer" log(|Vport|) + L(ex, P) 
—— —_e>=?>—’ n=l p (v, w) =a —_—” 
pattern code edge count port vertex id port code 


Fig. 6. Formulas used for computing description lengths. The structure G? = 
(V", E? 1,12) is shortened to G = (V, E,ly,lz) for ease of reading. 


vertex and its label. To avoid encoding twice the same edge, we decide—in 
undirected graphs—to encode edges with the vertex with the smallest identifier. 
Vertex and edge labels are encoded based on their relative usage in the original 
graph G° (Lige (l, G°) and LEE (lela, w), G°)). Since this encoding does not 
change between CTs, it is a meaningful way to compare them. 

The second element of a CT row is the code cp associated to the pattern 
(L(cp)). This code is based on the usage of the pattern in the rewritten graph. 

The last element of a CT row is the description of the pattern’s ports 
(L(IIp)). First, we encode the number of pattern’s ports (between 0 and |V]). 
Then we specify which vertices are ports: if there are k ports, then there are (Vly 
possibilities. Finally, we encode the port codes (L(c,, P)): their code is based on 
the usage of the port in the rewritten graph w.r.t. other ports of the pattern. 


Rewritten Graph. The rewritten graph has two types of vertices: port ver- 
tices and pattern embedding vertices. Port vertices do not have any associ- 
ated information, so we just need to encode their number. The description 
length L(D|M) = L(G") of the rewritten graph is the length needed for encoding 
the number of vertex ports plus the sum of the description lengths Lemb (v, P, G”) 
of the pattern embedding vertices v. Every pattern embedding vertex has a 
label [{-(v) specifying its pattern P, encoded with the code cp of the pattern. 
We then encode the number of edges of the vertex i.e. the number of ports of this 
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embedding in particular (between 0 and |J/p|). Then for each edge we encode 
the port vertex to which it is connected and to which port it corresponds (using 
the port code c,). 


Table 1. Characteristics of the datasets used in the experiments 


Dataset Graph count | |V| LE] |Lv || |£el 
AIDS-CA 423 16714 17854 | 21 3 
AIDS-CM 1082 34387 | 37033 | 26 3 
UD-PUD-En | 1000 21176 20176|}17 | 46 


3.4 The GRAPHMDL Algorithm 


In previous subsections we presented the different MDL definitions that GRAPH- 
MDL uses to evaluate pattern sets (CT). A naive algorithm for finding the most 
descriptive pattern set (in the MDL sense) could be to create a CT for every 
possible subset of candidates and retain the one yielding the smallest descrip- 
tion length. However, such an approach is often infeasible because of the large 
amount of possible subsets. That is why GRAPHMDL applies a greedy heuristic 
algorithm, adapting Krimp algorithm [10] to our MDL definitions. 

Like Krimp, our algorithm starts with a CT composed of all singletons, which 
we call CTo. One after the other, candidates are added to the CT if they allow to 
lower the description length. Two heuristics guide GRAPHMDL: the candidate 
order and the order of patterns in the CT. We use the same heuristics as Krimp, 
with the difference that we define the size of a pattern as its total number of 
labels (vertices and edges). We also implement Krimp’s “post-acceptance prun- 
ing”: after a pattern is accepted in the CT, GRAPHMDL verifies if the removal 
of some patterns from the CT allows to lower the description length L(M, D). 


4 Experimental Evaluation 


In order to evaluate our proposal, we developed a prototype of GRAPHMDL. 
The prototype was developed in Java 1.8 and is available as a git repository’. 


4.1 Datasets 


The first two datasets that we use, AIDS-CA and AIDS-CM, are part of the 
National Cancer Institute AIDS antiviral screen datat. They are collections of 
graphs often used to compare graph mining algorithms [11]. Graphs of this col- 
lection represent molecules: vertices are atoms and edges are bonds. We stripped 
all hydrogen atoms from the molecules, since their positions can be inferred. 
We took our third dataset, UD-PUD-En, from the Universal Dependen- 
cies project®. This project curates a collection of trees describing dependency 


3 https: //gitlab.inria.fr /fbariatt /graphmdl. 
t https: //wikinci-nih.gov/display /NCIDTPdata/AIDS+Antiviral+Screen+Data. 
5 https: //universaldependencies.org/. 
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Table 2. Experimental results for different candidate sets 


Dataset gSpan |Candidate Runtime||CT| HCA Median Median 
support |count label count |port count 
AIDS-CA 20% 2194 19m|115 24.42% 9 3 
AIDS-CA |15% 7867 | 1h47m/123 (21.64% |10 4 
AIDS-CA |10% 20596 | 3h36m|148 19.03% |11 3 
AIDS-CM |20% 433 22m|111 (28.91% 7 4 
AIDS-CM |15% 779 32m|131 (27.44% 9 4 
AIDS-CM |10% 2054 | 1h10m|163 24.94% 9 4 
AIDS-CM 5% 9943 | 5h02m|225 20.43% 9 4 
UD-PUD-En 10% 164 1m/162 39.55% 5 2 
UD-PUD-En| 5% 458 3m|249 34.45% 5 2 
UD-PUD-En| 1% 6021 19m/523 28.14% iG 2 
UD-PUD-En| 0% 233434 | 9h57m|773 26.25% 7 2 


relationships between words of sentences of multiple corpora in multiple lan- 
guages. We used the trees corresponding to the English version of the PUD 
corpus. 

Table 1 presents the main characteristics of the three datasets that we use: 
the number of elementary graphs in the dataset, the total amount of vertices, 
the total amount of edges, the number of different vertex labels, and the number 
of different edge labels. Since GRAPHMDL works on a single graph instead of a 
collection, we aggregate collections into a single graph with multiple connected 
components when needed. We generate candidate patterns by using a gSpan 
implementation available on its author’s website®. 


4.2 Quantitative Evaluation 


Table 2 presents the results of the first experiment. For instance the first line 
tells that we ran GRAPHMDL on the AIDS-CA dataset, with as candidates the 
2194 patterns generated by gSpan for a support threshold of 20%. It took 19 min 
for our approach to select a CT composed of 115 patterns, yielding a description 
length that is 24% of the description length obtained by the singleton-only CT. 
Selected patterns have a median of 9 labels and 3 ports. 

We observe that the number of patterns of a CT is often significantly smaller 
than the number of candidates. This is particularly remarkable for experiments 
ran with small support thresholds, where GRAPHMDL reduces the number of 
patterns up to 300 times: patterns generated for these support thresholds prob- 
ably contain a lot of redundancy, that GRAPHMDL avoids. 

We also note that the description lengths of the CTs found by GRAPHMDL 
are between 20% and 40% of the lengths of the baseline code tables C'To, which 
shows that our algorithm succeeds in finding regularities in the data. Description 


6 https: //sites.cs.ucsb.edu/~xyan/software/gSpan.htm. 


GraphMDL 63 


lengths are smaller when the number of candidates is higher: this may be because 
with more candidates, there are more chances of finding “good” candidates that 
allow to better reduce description lengths. 
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With GraphMDL With SUBDUE 


Fig. 7. How GRAPHMDL (left) and SUBDUE (right) encode one of AIDS-CM graphs. 


We observe that GRAPHMDL can find patterns of non-trivial size, as shown 
by the median label count in Table 2. Also, most patterns have few ports, which 
shows that GRAPHMDL manages to find models in which the original graph is 
described as a set of components without many connections between them. We 
think that a human will interpret such a model with more ease, as opposed to a 
model composed of “entangled” components. 


4.3 Qualitative Evaluations 


Interpretation of Rewritten Graphs. Figure 7 shows how GRAPHMDL uses pat- 
terns selected on the AIDS-CM dataset to encode one of the graphs of the 
dataset (more results are available in our git repository). It illustrates the key 
idea behind our approach: find a set of patterns so that each one describes part 
of the data, and connect their occurrences via ports to describe the whole data. 

We observe that GRAPHMDL selects bigger patterns (such as P2), describ- 
ing big chunks of data, as well as smaller patterns (such as P3, edge singleton), 
that can form bridges between pattern occurrences. Big patterns increase the 
description length of the CT, but describe more of the data in a single occur- 
rence, whereas small patterns do the opposite. Following the MDL principle, 
GRAPHMDL finds a good balance between the two types of patterns. 

It is interesting to note that pattern P1 in Fig. 7 corresponds to the carboxylic 
acid functional group, common in organic chemistry. GRAPHMDL selected this 
pattern without any prior knowledge of chemistry, solely by using MDL. 
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Comparison with SUBDUE. On the right of Fig. 7 we can observe the encoding 
found by SUBDUE on the same graph. The main disadvantage of SUBDUE is 
information loss: we can see that the data is composed of two occurrences of 
pattern P1, but not how these two occurrences are connected. Thanks to the 
notion of ports, GRAPHMDL does not suffer from this problem: the user can 
exactly know which atoms lie at the boundary of each pattern occurrence. 


Table 3. Classification accuracies. Results of methods marked with * are from [8]. 


Algorithm AIDS-CA/CI | Mutag PTC-MR PTC-FR 
Baseline-Largest | 50.01 + 0.03 | 66.50 + 0.00 | 55.80 + 0.00 65.50 + 0.00 
GRAPHMDL 71.61+0.96 | 80.79 + 1.51 | 57.38 + 1.68 62.70 + 1.86 


WL* N/A 87.26 + 1.42 | 63.12 + 1.44 | 67.64 + 0.74 
P-WL-C* N/A 90.51 + 1.34 | 64.02 + 0.82 | 67.15 + 1.09 
RetGKk* N/A 90.30 + 1.10 | 62.15 + 1.60 | 67.80 + 1.10 


Assessing Patterns Through Classification. We showed in the previous experi- 
ments that GRAPHM DL manages to reduce the amount of patterns, and that the 
introduction of ports allows for a precise analysis of graphs. We now ask ourselves 
if the extracted patterns are characteristic of the data. To evaluate this aspect, we 
adopt the classification approach used by Krimp [10]. We apply GRAPHMDL inde- 
pendently on each class of a multi-class dataset, and then use the resulting CTs to 
classify each graph: we encode it with each of the CTs, and classify it in the class 
whose CT yields the smallest description length L(D|M). Since GRAPHMDL is not 
designed with the goal of classification in mind, we would expect existing classifiers 
to outperform GRAPHMDL. In particular, note that patterns are selected on each 
class independently of other classes. Indeed, GRAPHMDL follows a descriptive 
approach whereas classifiers generally follow a discriminative approach. Table 3 
presents the results of this new experiment. We compare GRAPHMDL with graph 
classification algorithms found in the literature [8], and a baseline that classifies 
all graphs as belonging to the largest class. The AIDS-CA/CI dataset is composed 
of the CA class of the AIDS dataset and a same-size same-labels random sample 
from the CI class (corresponding to negative examples). The other datasets” are 
from [8]. We performed a 10-fold validation repeated 10 times and report average 
accuracies and standard deviations. 

GRAPHMDL clearly outperforms the baseline on two datasets, AIDS and 
Mutag, but is only comparable to the baseline for the PTC datasets. On Mutag, 
GRAPHMDL is less accurate than other classifiers but closer to them than to 
the baseline. On the PTC datasets, we hypothesize that the learned descriptions 
are not discriminative w.r.t. the chosen classes, although they are characteristic 
enough to reduce description length. Nonetheless results are still better than 
random guessing (accuracy would be 50%). An interesting point of GRAPHMDL 


T For concision, we do not report on PTC-{MM,FM}, they yield similar results. 
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classification is that it is explainable: the user can look at how the patterns of 
the two classes encode a graph (similarly to Fig. 7) and understand why one class 
is chosen over another. 


5 Conclusion 


In this paper, we have proposed GRAPHMDL, an MDL-based pattern mining 
approach to select a representative set of graph patterns on labeled graphs. We 
proposed MDL definitions allowing to compute description lengths necessary to 
apply the MDL principle. The originality of our approach lies in the notion of 
ports, which guarantee that the original graph can be perfectly reconstructed, 
i.e., without any loss of information. Our experiments show that GRAPHMDL 
significantly reduces the amount of patterns w.r.t. complete approaches. Further, 
the selected patterns can have complex shapes with simple connections. The 
introduction of the notion of ports facilitates interpretation w.r.t. to SUBDUE. 
We plan to apply our approach to more complex graphs, e.g. knowledge graphs. 
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Abstract. With the availability of user-generated content in the Web, 
malicious users dispose of huge repositories of private (and often sensi- 
tive) information regarding a large part of the world’s population. The 
self-disclosure of personal information, in the form of text, pictures and 
videos, exposes the authors of such contents (and not only them) to many 
criminal acts such as identity thefts, stalking, burglary, frauds, and so 
on. In this paper, we propose a way to evaluate the harmfulness of any 
form of content by defining a new data mining task called content sensi- 
tivity analysis. According to our definition, a score can be assigned to any 
object (text, picture, video...) according to its degree of sensitivity. Even 
though the task is similar to sentiment analysis, we show that it has its 
own peculiarities and may lead to a new branch of research. Thanks to 
some preliminary experiments, we show that content sensitivity analysis 
can not be addressed as a simple binary classification task. 


Keywords: Privacy - Text mining - Text categorization 


1 Introduction 


Internet privacy has gained much attention in the last decade due to the suc- 
cess of online social networks and other social media services that expose our 
lives to the wide public. In addition to personal and behavioral data collected 
more or less legitimately by companies and organizations, many websites and 
mobile/web applications store and publish tons of user-generated content in the 
form of text posts and comments, pictures and videos which, very often, capture 
and represent private moments of our life. The availability of user-generated con- 
tent is a huge source of relatively easy-to-access private (and often very sensitive) 
information concerning habits, preferences, families and friends, hobbies, health 
and philosophy of life, which expose the authors of such contents (or any other 
individual referenced by them) to many (cyber)criminal risks, including iden- 
tity theft, stalking, burglary, frauds, cyberbullying or “simply” discrimination 
in workplace or in life in general. Sometimes users are not aware of the dan- 
gers due to the uncontrolled diffusion of their sensitive information and would 
probably avoid publishing it if only someone told them how harmful it could be. 

In this paper, we address exactly this problem by proposing a way to measure 
the degree of sensitivity of any type of user-generated content. To this purpose, 
© The Author(s) 2020 
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we define a new data mining task that we call content sensitivity analysis (CSA), 
inspired by sentiment analysis [13]. The goal of CSA is to assign a score to any 
object (text, picture, video...) according to the amount of sensitive information it 
potentially discloses. The problem of private content analysis has already been 
investigated as a way to characterize anonymous vs. non anonymous content 
posting in specific social media [5, 15,16] or question-and-answer platforms [14]. 
However, the link between anonymity and sensitive contents is not that obvious: 
users may post anonymously because, for instance, they are referring to illegal 
matters (e.g., software/steaming piracy, black market and so on); conversely, 
fully identifiable persons may post very sensitive contents simply because they 
are underestimating the visibility of their action [18,19]. Although CSA has 
some points in common with anonymous content analysis and the well-known 
sentiment analysis task, we show that it has its own peculiarities and may lead 
to a brand new branch of research, opening many intriguing challenges in several 
computer science and linguistics fields. 

Through some preliminary but extensive experiments on a large annotated 
corpus of social media posts, we show that content sensitivity analysis can not 
be addressed straightforwardly. In particular, we design a simplified CSA task 
leveraging binary classification to distinguish between sensitive and non sensitive 
posts by testing several bag-of-words and word embedding models. According to 
our experiments, the classification performances achieved by the most accurate 
models are far from being satisfactory. This suggests that content sensitivity 
analysis should consider more complex linguistic and semantic aspects, as well 
as more sophisticated machine learning models. 

The remainder of the paper is organized as follows: we report a short analysis 
of the related scientific literature in Sect. 2 and Sect. 3 provides the definition of 
content sensitivity analysis and presents some challenging aspects of this new 
task together with some hints for possible solutions; the preliminary experiments 
are reported and discussed in Sect. 4; finally, Sect. 5 concludes by also presenting 
some open problems and suggestions for future research. 


2 Related Work 


With the success of online social networks and content sharing platforms, under- 
standing and measuring the exposure of user privacy in the Web has become 
crucial [11,12]. Thus, many different metrics and methods have been proposed 
with the goal of assessing the risk of privacy leakage in posting activities [1,23]. 
Most research efforts, however, focus on measuring the overall exposure of users 
according to their privacy settings [8,19] or position within the network [18]. 
Very few research works address the problem of measuring the amount of 
sensitivity of user-generated content, and yet different definitions of sensitivity 
are adopted. In [5], for instance, the authors define sensitivity of a social media 
post as the extent to which users think the post should be anonymous. Then, 
they try to understand the nature of content posted anonymously and analyze 
the differences between content posted on anonymous (e.g., Whisper) and non- 
anonymous (e.g., Twitter) social media sites. They also find significant linguistic 
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differences between anonymous and non-anonymous content. A similar approach 
has been applied on posts collected from a famous question-and-answer website 
[14]. The authors of this work identify categories of questions for which users are 
more likely to exercise anonymity and analyze different machine learning model 
to predict whether a particular answer will be written anonymously. They also 
show that post sensitivity should be viewed as a nuanced measure rather than as 
a binary concept. In [2], the authors propose a ranking-based method for assess- 
ing the privacy risk emerging from textual contents related to sensitive topics, 
such as depression. They use latent topic models to capture the background 
knowledge of an hypothetical rational adversary who aims at targeting the most 
exposed users. Additionally, the results are exploited to inform and alert users 
about their risk of being targeting. 

Similarly to sentiment analysis [13], valuable linguistic resources are needed 
to identify sensitive content in texts. To the best of our knowledge, the only 
works addressing this issue are [6,22], where the authors leverage prototype the- 
ory and traditional theoretical approaches to describe and evaluate a dictionary 
of privacy designed for content analysis. Dictionary categories are evaluated 
according to privacy-related categories from an existing content analysis tool, 
using a variety of text corpora. 

The problem of sensitive content detection has been investigated as a pattern 
recognition problem in images as well. In [25], the authors leverage massive 
social images and their privacy settings to learn the object-privacy correlation 
and identify categories of privacy-sensitive object automatically. To increase the 
accuracy and speed of the classifier, they propose a deep multi-task learning 
architecture that learn more representative deep convolutional neural networks 
and more discriminative tree classifier. Additionally, they use the outcomes of 
such model to identify the most suitable privacy settings and/or blur sensitive 
objects automatically. This framework is further improved in [24], where the 
authors add a clustering-based approach to also incorporate trustworthiness of 
users being granted to see the images in the prediction model. 

Contrary to the above-mentioned works, in this paper we formally define 
the general task of content sensitivity analysis independently from the type of 
data to be analyzed. Additionally, we provide some suggestions for improving 
the accuracy of the results and show experimentally that the task is challenging, 
and deserves further investigation and greater research efforts. 


3 Content Sensitivity Analysis 


In this section, we introduce the new data mining task that we call content 
sensitivity analysis (CSA), aimed at determining the amount of privacy-sensitive 
content expressed in any user-generated content. We first distinguish two cases, 
namely basic CSA and continuous CSA, according to the outcome of the analysis 
(binary or continuous). Then, we identify a set of subtasks and discuss their 
theoretical and technical details. Before introducing the technical details of CSA, 
we briefly provide the intuition behind CSA by describing a motivating example. 
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3.1 Motivating Example 


To explain the main objectives of CSA and the scientific challenges associated to 
them, we consider the example in Fig. 1. To decide whether (and to what extent) 
the sentence is sensitive, an inference algorithm should be able to answer the 
following questions: 


1. Subjects: whose information is going to be disclosed? 

2. Information types: does the post refer to any potentially sensitive infor- 
mation type? 

3. Terms: does the post mention any sensitive term? 

Topics: does the post mention any sensitive topic? 

5. Relations: is sensitive information referred to any of the subjects? 


> 


Ê Create Post j Photo/Video I Live Video Life Event 


Now at the General Hospital with my 
friend Alice Green for our first course 
of chemo! P 
(ETE ELECTIO I I I E 

@ feeling hopeful. 


B Photo/Video & Tag Friends © Feeling/Activ... 
© F News Feed 3 Friends v 


©) úo Your Story Åi Friends + 


Fig. 1. An example of a potentially privacy-sensitive post. 


By observing the post in Fig. 1, it is clear that: the post discloses information 
about the author and his friend Alice Green (1); the post contains spatiotempo- 
ral references (“now” and “General Hospital”), which are generally considered 
intrinsically sensitive; the post mentions “chemo”, a potentially sensitive term 
(3); the sentence is related to “cancer”, a potentially sensitive topic (4); the 
sentence structure suggests that the two subjects of disclosure have cancer and 
they are both about to start their first course of chemotherapy (5). 

It is clear that, reducing sensitivity to anonymity, as done in previous research 
work [5,14], is only one side of the coin. Instead, CSA has much more in common 
with the famous sentiment analysis (SA) task, where the objective is to measure 
the “polarity” or “sentiment” of a given text [7,13]. However, while SA has 
already a well-established theory and may count on a set of easy-to-access and 
easy-to-use tools, CSA has never been defined before. Therefore, apart from the 
known open problems in SA (such as sarcasm detection), CSA involves three 
new scientific challenges: 
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1. Definition of sensitivity. A clear definition of sensitivity is required. Sen- 
sitivity is often defined in the legal systems, such as in the EU General Data 
Protection Regulation (GDPR), as a characteristic of some personal data 
(e.g., criminal or medical records), but a cognitive and perceptive explana- 
tion of what can be defined as “sensitive” is still missing [22]. 

2. Sensitivity-annotated corpora. Large text corpora need to be annotated 
according to sensitivity and at multiple levels: at the sentence level (“I got 
cancer” is more sensitive than “I got some nice volleyball shorts”), at the 
topic level (“health” is more sensitive than “sports”) and at the term level 
(“cancer” is more sensitive than “shorts” ). 

3. Context-aware sensitivity. Due to its subjectivity, a clear evaluation of 
the context is needed. The fact that a medical doctor talks about cancer is 
not sensitive per se, but if she talks about some of her patients having cancer, 
she could disclose very sensitive information. 


In the following, we will provide the formal definitions concerning CSA and 
provide some preliminary ideas on how to address the problem. 


3.2 Definitions 


Here, we provide the details regarding the formal framework of content sensitivity 
analysis. To this purpose, we consider generic user-generated contents, without 
specifying their nature (whether textual, visual or audiovisual). We will propose 
a definition of “sensitivity” further in this section. The simplest way to define 
CSA is as follows: 


Definition 1 (basic content sensitivity analysis). Given a user-generated 
object o; E€ O, with O being the domain of all user-generated contents, the 
basic content sensitivity analysis task consists in designing a function fs : O —> 
{sens,na,ns}, such that f,(0;) = sens iff o; is privacy-sensitive, f,(0;) = ns iff 
o;i is not sensitive, otherwise f,(0;) = na. 


The na value is required since the assignment of a correct sensitivity value 
could be problematic when dealing with controversial contents or borderline 
topics. In some cases, assessing the sensitivity of a content object is simply 
impossible without some additional knowledge, i.e., the conversation a post is 
part of, the identity of the author of a post, and so on. In addition, sensitivity is 
not the same for all sensitive objects: a post dealing with health is certainly more 
sensitive than a post dealing with vacations, although both can be considered 
as sensitive. This suggests that, instead of considering sensitivity as a binary 
feature of a text, a more appropriate definition of CSA should take into account 
different degrees of sensitivity, as follows: 


Definition 2 (continuous content sensitivity analysis). Let o; € O be a 
user-generated object, with O being the domain of all user-generated contents. 
The continuous content sensitivity analysis task consists in designing a function 
fs: © — [-1,1], such that f,(0;) = 1 iff oi is maximally privacy-sensitive, 
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fs(0;) = —1 iff oi is minimally privacy-sensitive, f,(0;) = 0 iff oi has unknown 
sensitivity. The value o; = f;(0;) is the sensitivity score of object o;i. 


According to this definition, sensitive objects have 0 < ø < 1, while non sen- 
sitive posts have —1 < ø < 0. In general, when o ~ 0 the sensitivity of an object 
cannot be assessed confidently. Of course, by setting appropriate thresholds, a 
continuous CSA can be easily turned into a basic CSA task. 

At this point, a congruent definition of “sensitivity” is required to set up the 
task correctly. Although different characterizations of privacy-sensitivity exist, 
there is no consistent and uniform theory [22]; so, in this work, we consider a more 
generic, flexible and application-driven definition of privacy-sensitive content. 


Definition 3 (privacy-sensitive content). A generic user-generated content 
object is privacy-sensitive if it makes the majority of users feel uncomfortable 
in writing or reading it because it may reveal some aspects of their own or others’ 
private life to unintended people. 


Notice that “uncomfortableness” should not be guided by some moral or eth- 
ical judgement about the disclosed fact, but uniquely by its harmfulness towards 
privacy. Such a definition allows the adoption of the “wisdom of the crowd” prin- 
ciple in contexts where providing an objective definition of what is sensitive (and 
what is not sensitive) is particularly hard. Moreover, it has also an intuitive jus- 
tification. Different social media may have different meaning of sensitivity. For 
instance, in a professional social networking site, revealing details about one’s 
own job is not only tolerated, but also encouraged, while one may want to hide 
detailed information about her professional life in a generic photo-video sharing 
platform. Similarly, in a closed message board (or group), one may decide to 
disclose more private information than in open ones. Sensitivity towards certain 
topics also varies from country to country. As a consequence, function fs can be 
learnt according to an annotated corpus of content objects as follows. 


Definition 4 (sensitivity function learning). Let O = {(0;,0;) }4_, be a set 
of N annotated objects o; € O with the related sensitivity score o; € [—1, 1]. 
The goal of a sensitivity function learning algorithm is to search for a function 
fs: O— [-1,1], such that Ss (fs(0;) — oi)? is minimum. 


The simplest way to address this problem is by setting a regression (or clas- 
sification, in the case of basic CSA) task. However, we will show in Sect. 4 that 
such an approach is unable to capture the actual manifold of sensitivity accu- 
rately. Hence, in the following sections, we present a fine-grained definition of 
CSA together with a list of open subproblems related to CSA and provide some 
hints on how to address them. 


3.3 Fine-Grained Content Sensitivity Analysis 


In the previous section, we have considered contents as monolithic objects with 
a sensitivity score associated to them. However, in general, any user-generated 
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content object (text, video, picture) may contain both privacy-sensitive and 
privacy-unsensitive elements. For instance, a long text post (or video) may deal 
with some unsensitive topic but the author may insert some references to her or 
his private life. Similarly, a user may post a picture of her own desk deemed to 
be anonymous but some elements may disclose very private information (e.g., 
the presence of train tickets, drug paraphernalia, someone else’s photo and so 
on). Moreover, the same object (or some of its elements) may violate the privacy 
of multiple subjects, including the author and other people mentioned in the 
corpus, in a different way. For all these reasons, here we propose a fine-grained 
definition of content sensitivity analysis. The definition is as follows: 


Definition 5 (fine-grained content sensitivity analysis). Let o; € O be a 
user-generated content object. Let E; = {ej} Fes C E be a set of m; > 1 elements 
(or components) that constitutes the object oi, with E being the domain of all 
possible elements. Let P; = {pi ea CP be the set of ni > 1 persons (or subjects) 
mentioned in o;, with P being the domain of all subjects. The fine-grained content 
sensitivity analysis task consists in designing a function fs : E x P —> [-1,1], 
such that fs(e%, p) = 1 iff e% is maximally privacy-sensitive for subject pi, 
fs(€%, pi) = —1 iff ei is minimally privacy-sensitive for subject pi, fs(€$, p},) = 0 
iff e4 has unknown sensitivity for subject p}. The value oi, = fs(e4, pi) is the 
sensitivity score of element e; towards subject pj. 


Notice that |E£;| > 1 since each object contains at least one element (when 
|E;| = 1, the only element ef corresponds the object o; itself). Similarly |P;| > 1 
because each object has at least the author as subject. In the example reported 
in Fig. 1, the post contains only one element (there is only one sentence) and 
concerns two subjects (the author and Alice Green). According to Definition 5 
(and to what we said in Sect. 3.1), the sensitivity score of the post towards both 
the author and Alice Green will be high. 


3.4 Challenges and Possible Solutions 


Fine-grained content sensitivity analysis presents many scientific and technical 
challenges, and may benefit of the cross-fertilization of computational linguistics, 
machine learning and semantic analysis. Addressing the problem of connecting 
sensitivity to specific subjects in texts requires the solution of many NLP tasks 
such as named entity recognition, relation extraction [21], and coreference res- 
olution [4]. Additionally, concept extraction and topic modeling are important 
to understand whether a given text deals with sensitive content. To this pur- 
pose, privacy dictionaries [22] could provide a valid support for tagging certain 
topics/terms as sensitive or non-sensitive. Sentiment analysis and emotion detec- 
tion could also reveal private personality traits if related to contents associated 
to certain topics, persons or categories of persons. Furthermore, elements in a 
sentence cannot be simply considered as separated entities, but the connection 
between different parts of a text play an important role in determining the cor- 
rect fine-grained sensitivity. It is clear that such a complex problem requires the 
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availability of massive annotated text corpora and the design of robust machine 
learning algorithms to cope with the sparsity of the feature space. All these con- 
siderations apply to the case of visual and audiovisual content as well, but, in 
addition, the intrinsic difficulty of handling multimedia data makes the above 
mentioned challenge even harder and more computationally expensive. 

In the next section, we will show how the basic content sensitivity analysis 
settings can be modeled as a binary classification problem on text data using 
different approaches with scarce or moderate success, thus showing the necessity 
of a more systematic and in-depth investigation of the problem. 


4 Preliminary Experiments 


In this section, we report the results of some preliminary experiments aimed at 
showing the feasibility of content sensitivity analysis together with its difficulties. 
The experiments are conducted under the basic CSA framework (see Definition 1 
in Sect.3) with the only difference that we do not consider the “na” class. We 
set up a binary classification task to distinguish whether a given input text is 
privacy-sensitive or not. Before presenting the results, in the following, we first 
introduce the data, then we provide the details of our experimental protocol. 


4.1 Annotated Corpus 


Since all previous attempts of identifying sensitive text have leveraged user 
anonymity as a discriminant for sensitive content [5,14], there is no reliable 
annotated corpus that we can use as benchmark. Hence, we construct our own 
dataset by leveraging a crowdsourcing experiment. We use one of the datasets 
described in [3], consisting of 9917 anonymized social media posts, mostly writ- 
ten in English, with a minimum length of 2 characters and a maximum length 
of 435 (the average length is 80). Thus, they well represent typical social media 
short posts. On the other hand, they are not annotated for the specific purpose 
of our experiment and, because of their shortness, they are also very difficult to 
analyze. Consequently, after discarding all useless posts (mostly uncomprehen- 
sible ones) we have set up a crowdsourcing experiment by using a Telegram bot 
that, for each post, asks whether it is sensitive or not. As third option, it was 
also possible to select “unable to decide”. We collected the annotations of 829 
posts from 14 distinct annotators. For each annotated post, we retain the most 
frequently chosen annotation. Overall, 449 posts where tagged as non sensitive, 
230 as sensitive, 150 as undecidable. Thus, the final dataset consists of 679 posts 
of the first two categories (we discarded all 150 undecidable posts). 


4.2 Datasets 


We consider two distinct document representations for the dataset, a bag-of- 
words and four word vector models. To obtain the bag-of-word representation we 
perform the following steps. First, we remove all punctuation characters of terms 
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contained in the input posts as well as short terms (less than two characters) and 
terms containing digits. Then, we build the bag-of-words model with all remain- 
ing 2584 terms weighted by their tfidf score. Differently from classic text mining 
approaches, we deliberately exclude lemmatization, stemming and stop word 
removal from text preprocessing, since those common steps would affect content 
sensitivity analysis negatively. Indeed, inflections (removed by lemmatization 
and stemming) and stop words (like “me”, “myself”) are important to decide 
whether a sentence reproduces some personal thoughts or private action/status. 
Hereinafter, the bag-of-words representation is referred to as BW2584. 

The word vector representation, instead, is built using word vectors pre- 
trained with two billion tweets (corresponding to 42 billion tokens) using the 
GloVe (Global Vector) model [17]. We use this word embedding method as it 
consistently outperforms both continuous bag-of-words and skip-gram model 
architectures of word2vec [10]. In detail, we use three representation, here called 
WV25, WV50 and WV100 with, respectively, 25, 50 and 100 dimensions!. Addi- 
tionally, we build an ensemble by considering the concatenation of the three 
vector spaces. The latter representation is named WVEns. 

Finally, from all five datasets we removed all posts having an empty bag- 
of-words or word vector representation. Such preprocessing step further reduces 
the size of the dataset down to 611 posts (221 sensitive and 390 non sensitive), 
but allows for a fair performance comparison. 


4.3 Experimental Settings 


Each dataset obtained as described beforehand is given in input to a set of six 
classifiers. In details, we use k-NN, decision tree (DT), Multi-layer Perceptron 
(MLP), SVM, Random Forest (RF), and Gradient Boosted trees (GBT). We do 
not execute any systematic parameter selection procedure since our main goal is 
not to compare the performances of classifiers, but, rather, to show the overall 
level of accuracy that can be achieved in a basic content sensitivity analysis task. 
Hence, we use the following default parameter for each classifier: 


— kNN: we set k = 3 in all experiments; 

— DT: for all datasets, we use C4.5 with Gini Index as split criterion, allowing a 
minimum of two records per node and minimum description length as pruning 
strategy; 

— MLP: we train a shallow neural network with one hidden layer; the number 
of neurons of the hidden layer is 30 for the bag-of-words representation and 
20 for all word vector representations; 

— SVM: for all datasets, we use the polynomial kernel with default parameters; 

— RF: we train 100 models with Gini index as splitting criterion in all experi- 
ments; 

— GBT: for all datasets, we use 100 models with 0.1 as learning rate and 4 as 
maximum tree depth. 


1 Pre-trained vectors are available at https://nlp.stanford.edu/projects/glove/. 
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All experiments are conducted by performing ten-fold cross-validation, using, 
for each iteration, nine folds as training set and the remaining fold as test set. 


4.4 Results and Discussion 


The summary of the results, in terms of average F1-score, are reported in Table 1. 
It is worth noting that the scores are, in general, very low (between 0.5826, 
obtained by the neural network on the bag-of-words model, and 0.6858, obtained 
by Random Forest on the word vector representation with 50 dimensions). Of 
course, these results are biased by the fact that data are moderately unbalanced 
(64% of posts fall in the non-sensible class). However they are not completely 
negative, meaning that there is space for improvement. We observe that the win- 
ning model-classifier pair (50-dimensional word vector processed with Random 
Forest) exhibits high recall on the non-sensitive class (0.928) and rather similar 
results in terms of precision for the two classes (0.671 and 0.688 for the sensitive 
and non-sensitive classes respectively). The real negative result is the low recall 
on the sensitive class (only 0.258), due to the high number of false negatives”. We 
recall that the number of annotated sensitive posts is only 221, i.e., the number 
of examples is not sufficiently large for training a prediction model accurately. 


Table 1. Classification in terms of average F1-score for different post representations. 


Dataset | Type kKNN | DT MLP |SVM |RF GBT 
BW2584 | bag-of-words || 0.6579 | 0.6743 | 0.5826 | 0.6481 | 0.6776 | 0.6678 
WV25_ |word vector || 0.6203 | 0.6317 | 0.6497 | 0.6383 | 0.6628 | 0.6268 
WV50 =| word vector || 0.6121 | 0.6105 | 0.6530 | 0.6448 | 0.6858 | 0.6399 
WV100 | word vector || 0.6367 | 0.6088 | 0.6497 | 0.6563 | 0.6694 | 0.6497 
WVEns | word vector || 0.6432 | 0.5859 | 0.6481 | 0.6547 | 0.6628 | 0.6416 


These results highlight the following issues and perspectives. First, nega- 
tive (or not-so-positive) results are certainly due to the lack of annotated data 
(especially for the sensitive class). Sparsity is certainly a problem in our set- 
tings. Hence, a larger annotated corpus is needed, although this objective is not 
trivial. In fact, private posts are often difficult to obtain, because social media 
platforms (luckily, somehow) do not allow users to get them using their API. 
As a consequence, all previous attempts to guess the sensitivity of text or con- 
struct privacy dictionaries strongly leverage user anonymity in public post shar- 
ing activities [5,14], or rely on focus groups and surveys [22]. Moreover, without 
a sufficiently large corpus, not even the application of otherwise successful deep 
learning techniques (e.g., RNNs for sentiment analysis [9]) would produce valid 
results. Second, simple classifiers, even when applied to rather complex and rich 
representations, can not capture the manifold of privacy sensitivity accurately. 


2 Due to space limitations, we do not report detailed precision/recall results. 
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So, more complex and heterogenous models should be considered. Probably, an 
accurate sensitivity content analysis tool should consider lexical, semantic as 
well as grammatical features. Topics are certainly important, but sentence con- 
struction and lexical choices are also fundamental. Therefore, reliable solutions 
would consist of a combination of computational linguistic techniques, machine 
learning algorithms and semantic analysis. Third, the success of picture and 
video sharing platforms (such as Instagram and TikTok), implies that any suc- 
cessful sensitivity content analysis tool should be able to cope with audiovisual 
contents and, in general, with multimodal/multimedia objects (an open problem 
in sentiment analysis as well {20]). Finally, provided that a taxonomy of privacy 
categories in everyday life exists (e.g., health, location, politics, religious belief, 
family, relationships, and so on) a more complex CSA setting might consider, 
for a given content object, the privacy sensitivity degree in each category. 


5 Conclusions 


In this paper, we have addressed the problem of determining whether a given 
content object is privacy-sensitive or not by defining the generic task of content 
sensitivity analysis (CSA). Then, we have declined it according to increasing 
complexity of the problem settings. Although the task promises to be challeng- 
ing, we have shown that it is not unfeasible by presenting a simplified formulation 
of CSA based on text categorization. With some preliminary but extensive exper- 
iments, we have showed that, no matter the data representation, the accuracy of 
such classifiers can not be considered satisfactory. Thus, it is worth investigat- 
ing more complex techniques borrowed from machine learning, computational 
linguistics and semantic analysis. Moreover, without a strong effort in building 
massive and reliable annotated corpora, the performances of any CSA tool would 
be barely sufficient, no matter the complexity of the learning model. 


Acknowledgments. The authors would like to thank Daniele Scanu for implementing 
the Telegram bot used by the annotators. This work is supported by Fondazione CRT 
(grant number 2019-0450). 


References 


1. Alemany, J., del Val Noguera, E., Alberola, J.M., Garcia-Fornes, A.: Metrics for pri- 
vacy assessment when sharing information in online social networks. IEEE Access 
7, 143631-143645 (2019) 

2. Biega, J.A., Gummadi, K.P., Mele, I., Milchevski, D., Tryfonopoulos, C., Weikum, 
G.: R-Susceptibility: an IR-centric approach to assessing privacy risks for users in 
online communities. In: Proceedings of ACM SIGIR 2016, pp. 365-374 (2016) 

3. Celli, F., Pianesi, F., Stillwell, D., Kosinski, M.: Workshop on computational per- 
sonality recognition: shared task. In: Proceedings of ICWSM 2013 (2013) 

4. Clark, K., Manning, C.D.: Improving coreference resolution by learning entity-level 
distributed representations. In: Proceedings of ACL 2016 (2016) 


78 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


E. Battaglia et al. 


Correa, D., Silva, L.A., Mondal, M., Benevenuto, F., Gummadi, K.P.: The many 
shades of anonymity: characterizing anonymous social media content. In: Proceed- 
ings of ICWSM 2015, pp. 71-80 (2015) 

Gill, A.J., Vasalou, A., Papoutsi, C., Joinson, A.N.: Privacy dictionary: a linguistic 
taxonomy of privacy for content analysis. In: Proceedings of ACM CHI 2011, pp. 
3227-3236 (2011) 

Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggar- 
wal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 415-463. Springer, Heidelberg 
(2012). https: //doi.org/10.1007/978-1-4614-3223-4 13 

Liu, K., Terzi, E.: A framework for computing the privacy scores of users in online 
social networks. TKDD 5(1), 6:1-6:30 (2010) 

Ma, Y., Peng, H., Khan, T., Cambria, E., Hussain, A.: Sentic LSTM: a hybrid 
network for targeted aspect-based sentiment analysis. Cogn. Comput. 10(4), 639- 
650 (2018). https://doi.org/10.1007/s12559-018-9549-x 

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represen- 
tations of words and phrases and their compositionality. In: Proceedings of NIPS 
2013, pp. 3111-3119 (2013) 

Oukemeni, S., Rifa-Pous, H., i Puig, J.M.M.: IPAM: information privacy assess- 
ment metric in microblogging online social networks. IEEE Access 7, 114817- 
114836 (2019) 

Oukemeni, S., Rifa-Pous, H., i Puig, J.M.M.: Privacy analysis on microblogging 
online social networks: a survey. ACM Comput. Surv. 52(3), 60:1-60:36 (2019) 
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. 
Retrieval 2(1—2), 1-135 (2007) 

Peddinti, S.T., Korolova, A., Bursztein, E., Sampemane, G.: Cloak and swagger: 
understanding data sensitivity through the lens of user anonymity. In: Proceedings 
of IEEE SP 2014, pp. 493-508 (2014) 

Peddinti, S.T., Ross, K.W., Cappos, J.: Finding sensitive accounts on Twitter: 
an automated approach based on follower anonymity. In: Proceedings of ICWSM 
2016, pp. 655-658 (2016) 

Peddinti, S.T., Ross, K.W., Cappos, J.: User anonymity on Twitter. IEEE Secur. 
Privacy 15(3), 84-87 (2017) 

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre- 
sentation. In: Proceedings of EMNLP 2014, pp. 1532-1543 (2014) 

Pensa, R.G., di Blasi, G., Bioglio, L.: Network-aware privacy risk estimation in 
online social networks. Soc. Netw. Analys. Mining 9(1), 15:1-15:15 (2019) 

Pensa, R.G., Blasi, G.D.: A privacy self-assessment framework for online social 
networks. Expert Syst. Appl. 86, 18-31 (2017) 

Poria, S., Majumder, N., Hazarika, D., Cambria, E., Gelbukh, A.F., Hussain, A.: 
Multimodal sentiment analysis: addressing key issues and setting up the baselines. 
IEEE Intell. Syst. 33(6), 17-25 (2018) 

Surdeanu, M., McClosky, D., Smith, M., Gusev, A., Manning, C.D.: Customiz- 
ing an information extraction system to a new domain. In: Proceedings of 
RELMS@ACL 2011, pp. 2-10 (2011) 

Vasalou, A., Gill, A.J., Mazanderani, F., Papoutsi, C., Joinson, A.N.: Privacy 
dictionary: a new resource for the automated content analysis of privacy. JASIST 
62(11), 2095-2105 (2011) 

Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. ACM Com- 
put. Surv. 51(3), 57:1-57:38 (2018) 


Towards Content Sensitivity Analysis 79 


24. Yu, J., Kuang, Z., Zhang, B., Zhang, W., Lin, D., Fan, J.: Leveraging content 
sensitiveness and user trustworthiness to recommend fine-grained privacy settings 
for social image sharing. IEEE Trans. Inf. Forensics Secur. 13(5), 1317-1332 (2018) 

25. Yu, J., Zhang, B., Kuang, Z., Lin, D., Fan, J.: iPrivacy: image privacy protec- 
tion by identifying sensitive objects via deep multi-task learning. IEEE Trans. Inf. 
Forensics Secur. 12(5), 1005-1016 (2017) 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Check for 
updates 


Gibbs Sampling Subjectively 
Interesting Tiles 


Anes Bendimerad!®), Jefrey Lijffijt?, Marc Plantevit®, Céline Robardet!, 
and Tijl De Bie? 


1 Univ Lyon, INSA, CNRS UMR 5205, 69621 Villeurbanne, France 
ahmed-anes.bendimerad@insa-lyon.fr 
2 IDLab, ELIS Department, Ghent University, Ghent, Belgium 
3 Univ Lyon, UCBL, CNRS UMR 5205, 69621 Lyon, France 


Abstract. The local pattern mining literature has long struggled with 
the so-called pattern explosion problem: the size of the set of patterns 
found exceeds the size of the original data. This causes computational 
problems (enumerating a large set of patterns will inevitably take a sub- 
stantial amount of time) as well as problems for interpretation and usabil- 
ity (trawling through a large set of patterns is often impractical). 

Two complementary research lines aim to address this problem. The 
first aims to develop better measures of interestingness, in order to reduce 
the number of uninteresting patterns that are returned [6,10]. The sec- 
ond aims to avoid an exhaustive enumeration of all ‘interesting’ patterns 
(where interestingness is quantified in a more traditional way, e.g. fre- 
quency), by directly sampling from this set in a way that more ‘interest- 
ing’ patterns are sampled with higher probability [2]. 

Unfortunately, the first research line does not reduce computational 
cost, while the second may miss out on the most interesting patterns. 
In this paper, we combine the best of both worlds for mining inter- 
esting tiles [8] from binary databases. Specifically, we propose a new 
pattern sampling approach based on Gibbs sampling, where the proba- 
bility of sampling a pattern is proportional to their subjective interest- 
ingness [6]—an interestingness measure reported to better represent true 
interestingness. 

The experimental evaluation confirms the theory, but also reveals an 
important weakness of the proposed approach which we speculate is 
shared with any other pattern sampling approach. We thus conclude 
with a broader discussion of this issue, and a forward look. 


Keywords: Pattern mining - Subjective interestingness - Pattern 
sampling - Gibbs sampling 


1 Introduction 


Pattern mining methods aim to select elements from a given language that bring 
to the user “implicit, previously unknown, and potentially useful information 
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from data” [7]. To meet the challenge of selecting the appropriate patterns for 
a user, several lines of work have been explored: (1) Many constraints on some 
measures that assess the quality of a pattern using exclusively the data have 
been designed [4,12,13]; (2) Preference measures have been considered to only 
retrieve patterns that are non dominated in the dataset; (3) Active learning 
systems have been proposed that interact with the user to explicit her interest 
on the patterns and guide the exploration toward those she is interested in; (4) 
Subjective interestingness measures [6,10] have been introduced that aim to take 
into account the implicit knowledge of a user by modeling her prior knowledge 
and retrieving the patterns that are unlikely according to the background model. 

The shift from threshold-constraints on objective measures toward the use of 
subjective measures provides an elegant solution to the so-called pattern explo- 
sion problem by considerably reducing the output to only truly interesting pat- 
terns. Unfortunately, the discovery of subjectively interesting patterns with exact 
algorithms remains computationally challenging. 

In this paper we explore another strategy that is pattern sampling. The 
aim is to reduce the computational cost while identifying the most important 
patterns, and allowing for distributed computations. There are two families of 
local pattern sampling techniques. 

The first family uses Metropolis Hastings [9], a Markov Chain Monte Carlo 
(MCMC) method. It performs a random walk over a transition graph represent- 
ing the probability of reaching a pattern given the current one. This can be done 
with the guarantee that the distribution of the considered quality measure is 
proportional on the sample set to the one of the whole pattern set [1]. However, 
each iteration of the random walk is accepted only with a probability equal to the 
acceptance rate a. This can be very small, which may result in a prohibitively 
slow convergence rate. Moreover, in each iteration the part of the transition 
graph representing the probability of reaching patterns given the current one, 
has to be materialized in both directions, further raising the computational cost. 
Other approaches [5,11] relax this constraint but lose the guarantee. 

Methods in the second family are referred to as direct pattern sampling 
approaches [2,3]. A notable example is [2], where a two-step procedure is pro- 
posed that samples frequent itemsets without simulating stochastic processes. In 
a first step, it randomly selects a row according to a first distribution, and from 
this row, draws a subset of items according to another distribution. The combi- 
nation of both steps follows the desired distribution. Generalizing this approach 
to other pattern domains and quality measures appeared to be difficult. 

In this paper, we propose a new pattern sampling approach based on Gibbs 
sampling, where the probability of sampling a pattern is proportional to their 
Subjective Interestingness (SI) [6]. Gibbs sampling — described in Sect.3 — is 
a special case of Metropolis Hastings where the acceptance rate a is always 
equal to 1. In Sect. 4, we show how the random walk can be simulated with- 
out materializing any part of the transition graph, except the currently sampled 
pattern. While we present this approach particularly for mining tiles in rectan- 
gular databases, applying it for other pattern languages can be relatively easily 
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achieved. The experimental evaluation (Sect. 5) confirms the theory, but also 
reveals a weakness of the proposed approach which we speculate is shared by 
other direct pattern sampling approaches. We thus conclude with a broader dis- 
cussion of this issue (Sect. 6), and a forward look (Sect. 7). 


2 Problem Formulation 


2.1 Notation 


Input Dataset. A dataset D is a Boolean matrix with 
m rows and n columns. For i € [1,m] and j € [1,n], 
D(i,j) € {0,1} denotes the value of the cell corre- 
sponding to the i-th row and the j-th column. For a 


Table 1. Example of a 
binary dataset D. 


given set of rows I C [1,m], we define the support #(1)2/3}4}5 
function suppco(Z) that gives all the columns having 1 [0}1/0}1/0 
a value of 1 in all the rows of I, i.e., suppo(1) = 2 |0/1/1]0)0 
{j € [n] | Vi € I: D(i,j) = 1}. Similarly, for 3/1/0101 
a set of columns J C [1,n], we define the function 4 [0|1/1|1/0 
suppr(J) = {i € [l,m] | Yj € J: D(i, j) = 1}. Table 1 5 (11111 
shows a toy example of a Boolean matrix, where for 6 [01|1|10 
I = {4,5,6} we have that suppc(I) = {2,3, 4}. 7 |O;L\1} 141 


Pattern Language. This paper is concerned with a particular kind of pattern 
known as a tile [8], denoted 7 = (J, J) and defined as an ordered pair of a set 
of rows I C {1,...,m} and a set of columns J C {1,...n}. A tile 7 is said to be 
contained (or present) in D, denoted as r € D, iff D(i, j) = 1 for all i € I and 
j € J. The set of all tiles present in the dataset is denoted as T and is defined 
as: T = {(I, J) | I C {1,. m} AJC {1,...n} A (I, J) € D}. In Table 1, the tile 
Tı = ({4,5, 6,7}, {2,3,4}) is present in D (7 € T), because each of its cells has 
a value of 1, but T2 = ({1,2}, {2,3}) is not present (72 ¢ T) since D(1,3) = 0. 


2.2 The Interestingness of a Tile 


In order to assess the quality of a tile 7, we use the framework of subjective 
interestingness SI proposed in [6]. We briefly recapitulate the definition of this 
measure for tiles, denoted SI(7) for a tile 7, and refer the reader to [6] for 
more details. SI(7) measures the quality of a tile r as the ratio of its subjective 
information content IC(r) and its description length DL(7): 


_ IC(7) 
~ DL(r)’ 


SI(r) 


Tiles with large SI(r) thus compress subjective information in a short descrip- 
tion. Before introducing IC and DL, we first describe the background model—an 
important component required to define the subjective information content IC. 


Background Model. The SI is subjective in a sense that it accounts for prior 
knowledge of the current data miner. A tile 7 is informative for a particular 
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user if this tile is somehow surprising for her, otherwise, it does not bring new 
information. The most natural way for formalizing this is to use a background 
distribution representing the data miner’s prior expectations, and to compute the 
probability Pr(r € D) of this tile under this distribution. The smaller Pr(r € D), 
the more information this pattern contains. Concretely, the background model 
consists of a value Pr(D(i, j) = 1) associated to each cell D(z, 7) of the dataset, 
and denoted p;;. More precisely, p;; is the probability that D(z,7) = 1 under 
user prior beliefs. In [6], it is shown how to compute the background model and 
derive all the values p;; corresponding to a given set of considered user priors. 
Based on this model, the probability of having a tile r = (I, J) in D is: 


Pr(r € D) = Pr \ D(i,j) =1] = II Pij- 


icI, jeJ icI,jEJ 


Information Content IC. This measure aims to quantify the amount of infor- 
mation conveyed to a data miner when she is told about the presence of a tile 
in the dataset. It is defined for a tile r = (I, J) as follows: 


IC(r) = —log(Pr(r € D)) = X —log(piy). 
tel jET 


Thus, the smaller Pr(r € D), the higher IC(r), and the more informative r. 
Note that for T1, T2 E€ D : IC(™ U T2) = IC(71) + IC(72) — IC(m AN T2). 


Description Length DL. This function should quantify how difficult it is for a 
user to assimilate the pattern. The description length of a tile r = (J, J) should 
thus depend on how many rows and columns it refers to: the larger are |I| and 
|J|, the larger is the description length. Thus, DL(7) can be defined as: 


DL(r) =a+b- (|Z|+|J]), 


where a and b are two constants that can be handled to give more or less impor- 
tance to the contributions of |I| and |.J| in the description length. 


2.3 Problem Statement 


Given a Boolean dataset D, the goal is to sample a tile 7 from the set of all the 
tiles T present in D, with a probability of sampling Ps proportional to SI(r), 
that is: P(T) = — sl) 

Ea ET Ir’) 


A naïve approach to sample a tile pattern according to this distribution is 
to generate the list {71, ..., Ty} of all the tiles present in D, sample x € [0,1] 


, i ap Ea SIr) < Ei SI) 
uniformly at random, and return the tile 7, with SSe S r< SSe) 
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However, the goal behind using sampling approaches is to avoid materializing the 
pattern space which is generally huge. We want to sample without exhaustively 
enumerating the set of tiles. In [2], an efficient procedure is proposed to directly 
sample patterns according to some measures such as the frequency and the area. 
However, this procedure is limited to only some specific measures. Furthermore, 
it is proposed for pattern languages defined on only the column dimension, for 
example, itemset patterns. In such language, the rows related to an itemset 
pattern F C {1,...,n} are uniquely identified and they correspond to all the 
rows containing the itemset, that are suppr(F’). In our work, we are interested 
in tiles which are defined by both columns and rows indices. In this case, it is 
not clear how the direct procedure proposed in [2] can be applied. 

For more complex pattern languages, a generic procedure based on Metropo- 
lis Hasting algorithm has been proposed in [9], and illustrated for subgraph 
patterns with some quality measures. While this approach is generic and can be 
extended relatively easily to different mining tasks, a major drawback of using 
Metropolis Hasting algorithm is that the random walk procedure contains the 
acceptance test that needs to be processed in each iteration, and the accep- 
tance rate a can be very small, which makes the convergence rate practically 
extremely slow. Furthermore, Metropolis Hasting can be computationally expen- 
sive, as the part of the transition graph representing the probability of reaching 
patterns given the current one, has to be materialized. 

Interestingly, a very useful MCMC technique is Gibbs sampling, which is a 
special case of Metropolis-Hasting algorithm. A significant benefit of this app- 
roach is that the acceptante rate a is always equal to 1, i.e., the proposal of 
each sampling iteration is always accepted. In this work, we use Gibbs sampling 
to draw patterns with a probability distribution that converges to Ps. In what 
follows, we will first generically present the Gibbs sampling approach, and then 
we show how we efficiently exploit it for our problem. Unlike Metropolis Hast- 
ing, the proposed procedure performs a random walk by materializing in each 
iteration only the currently sampled pattern. 


3 Gibbs Sampling 


Suppose we have a random variable X = (X1, X2, ..., X1) taking values in some 
domain Dom. We want to sample a value x € Dom following the joint distri- 
bution P(X = x). Gibbs sampling is suitable when it is hard to sample directly 
from P but known how to sample just one dimension x, (k € [1,/]) from 
the conditional probability P(X, = zk | Xi = £1, ..., Xk-1 = Up-1, Xp = 
Lk41; ++; A, = x1). The idea of Gibbs sampling is to generate samples by sweep- 
ing through each variable (or block of variables) to sample from its conditional 
distribution with the remaining variables fixed to their current values. Algo- 
rithm 1 depicts a generic Gibbs Sampler. At the beginning, x is set to its ini- 
tial values (often values sampled from a prior distribution q). Then, the algo- 
rithm performs a random walk of p iterations. In each iteration, we sample 
zı ~ P(X, = gf) | X2 = ot Xi = af) (while fixing the other dimen- 
sions), then we follow the same procedure to sample x2, ..., until z. 
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Algorithm 1: Gibbs sampler 

1 Initialize 2 ~ q(x) 

2 for k € [1,p] do 

3 draw a ~P (xı =2,|X2= rf», X3 = oe), ERD C = a= 


A 
Q. 
S 
E 
Sa 
2 
me) 

iii. 
Se 
N 
N 
Se 
ka 
lI 
8 

HAN 
= 
Se 
8 
= 
B 
ka 
| 
8 
F 
g 
Sm” 


5 7 
6 draw al) ~ P (xı =m AS z 1 = s, wy XI-1 = a) 


7 return x) 


The random walk needs to satisfy some constraints to guarantee that the 
Gibbs sampling procedure converges to the stationary distribution P. In the 
case of a finite number of states (a finite space Dom in which X takes values), 
sufficient conditions for the convergence are irreducibility and aperiodicity: 


Irreducibility. A random walk is irreducible if, for any two states x,y € Dom s.t. 
P(x) > 0 and P(y) > 0, we can get from x to y with a probability > 0 ina 
finite number of steps. I.e. the entire state space is reachable. 

Aperiodicity. A random walk is aperiodic if we can return to any state x € Dom 
at any time. I.e. revisiting x is not conditioned to some periodicity constraint. 


One can also use blocked Gibbs sampling. This consists in growing many 
variables together and sample from their joint distribution conditioned to the 
remaining variables, rather than sampling each variable x; individually. Blocked 
Gibbs sampling can reduce the problem of slow mixing that can be due to the 
high number of dimensions used to sample from. 


4 Gibbs Sampling of Tiles with Respect to SI 


In order to sample a tile r = (J, J) with a probability proportional to SI(r), we 
propose to use Gibbs sampling. The simplest solution is to consider a tile 7 as 
m +n binary random variables (£1, ...,Um,---;Um+n), each of them corresponds 
to a row or a column, and then apply the procedure described in Algorithm 1. In 
this case, an iteration of Gibbs sampling requires to sample from each column and 
row separately while fixing all the remaining rows and columns. The drawback 
of this approach is the high number of variables (m + n) which may lead to a 
slow mixing time. In order to reduce the number of variables, we propose to 
split r = (J, J) into only two separated blocks of random variables I and J, we 
then directly sample from each block while fixing the value of the other block. 
This means that an iteration of the random walk contains only two sampling 
operations instead of m+n ones. We will explain in more details how this Blocked 
Gibbs sampling approach can be applied, and how to compute the distributions 
used to directly sample a block of rows or columns. 
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Algorithm 2: Gibbs-SI 
1 Initialize (I, J) ~ q(x) 
for k € |L, p] do 
| draw I‘) ~ P (1 =I|J= geo, draw J® ~ P (J ey ae a) 


o N 


4 return (I, J)® 


Algorithm 2 depicts the main steps of Blocked Gibbs sampling for tiles. We 
start by initializing (I,J) with a distribution q proportional to the area (|I| x 
|J|) following the approach proposed in [2]. This choice is mainly motivated by 
its linear time complexity of sampling. Then, we need to efficiently sample from 
P(I = I | J = J) and P(J = J | I = T). In the following, we will explain how 
to sample I with P(I = I|J = J), and since the SI is symmetric w.r.t. rows 
and columns, the same strategy can be used symmetrically to sample a set of 
columns with P(J = J | I= T). 


Sampling a Set of Rows I Conditioned to Columns J. For a specific J C 
{1,...,n}, the number of tiles (J, J) present in the dataset can be huge, and can 
go up to 2”. This means that naively generating all these candidate tiles and then 
sampling from them is not a solution. Thus, to sample a set of rows J conditioned to 
a fixed set of columns J, we propose an iterative algorithm that builds the sampled 
I by drawing each i € I separately, while ensuring that the joint distribution of 
all the drawings is equal to P(I = I|J = J). I is built using two variables: Ry C 
{1,..., m} made of rows that belong to I, and Rə C {1,...,m}\ Rı that contains 
candidate rows that can possibly be sampled and added to Rj. Initially, we have 
Rı = Ú and Rg = suppr(J). At each step, we take i € R2, do a random draw to 
determine whether 7 is added to R or not, and remove it from Rə. When Rə = Ô, 
the sampled set of rows I is set equal to R,. To apply this strategy, all we need 
is to compute P (i € I | Ry CIC Ri U R2 AJ = J), the probability of sampling i 
considering the current sets R1, Rg and J: 

P(Ri VU {i} CIC RiUR2AT = J) 

P(RiU CIC Ri URZAT=J) 


: ICR, fi}, 7) +1 Ce, 2+) 
O rcr SIRU {GUF J) Urcrs\ ti} “ato 


SI(Rı UF, J IC(R1,D:)+IC(F,D; 
Lici OR) Erce AURTHEN 


k=0 FURET X rcr (IC(R1 U {i}, J) + IC(F, J)) 


|F|=k 


D ae Leem VCE, ytd) 


| 
Ral- J= z 2 : 
Se Gee ae *) 1C(Ri U {i}, J) + (772497) ICR \ {i},J)) 
pair ath UR EHT (eo -IC(Ri, J) + (727°) -IC(Ra, J)) 


_ IC(Ri U {i}, J) - f(|R2| — 1,|Ri| + 1) + IC(Re \ {i}, J) < f(|Re| — 2,|Ri| + 1) 
IC(Ri, J) - f(|Rel, | Ril) + IC(R2, J) - f(|Re| — 1, | Ril) 


PGEI|R, CICRURAJ=J)= 


y 
= 
S 
| 
ti 


x 


with f(x,y) = k-o oe 
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Complexity. Let’s compute the complexity of sampling I with a probability 
P(I = I|J = J). Before starting the sampling of rows from R2, we first compute 
the value of IC({i}, J) for each i € Rə (in O(n-m)). This will allow to compute 
in O(1) the values of IC that appear in P (i€ I| Ri CIC RiU R24AJ= J), 
based on the relation IC(, U I2, J) = IC(h, J) + IC(I2, J) for L, I C [1, m]. 
In addition to that, sampling each element i € Rə requires to compute the 
corresponding values of f(x,y). These values are computed once for the first 
sampled row i € Ry with a cost of O(m), and then they can be updated directly 
when sampling the next rows, using the following relation: 


1 
a+b-(a+y+|J|) 


This means that the overall cost of sampling the whole set of rows J with a 
probability P(I = I|J = J) is O(n- m). Following the same approach, sampling 
J conditionned to I is done in O(n-m). As we have p sampling iterations, 
the worst case complexity of the whole Gibbs sampling procedure of a tile 7 is 
O(p-n-m). 


Convergence Guarantee. In order to guarantee the convergence to the station- 
ary distribution proportional to the SI measure, the Gibbs sampling procedure 
needs to satisfy some constraints. In our case, the sampling space is finite, as 
the number of tiles is limited to at most 2”*”". Then, the sampling procedure 
converges if it satisfies the aperiodicity and the irreducibility constraints. The 
Gibbs sampling for tiles is indeed aperiodic, as in each iteration it is possible 
to remain in exactly the same state. We only have to verify if the irreducibil- 
ity property is satisfied. We can show that, in some cases, the random walk is 
reducible, we will show how to make Gibbs sampling irreducible in those cases. 


Theorem 1. Let us consider the bipartite graph G = (U,V, E) derived from the 
dataset D, s.t., U = {1,..,m}, V = {1,...,n}, and E = { (i,j) |i € [l,m] Aj € 
[1,n] A D(i, j) = 1}. A tile r = (I,J) present in D corresponds to a complete 
bipartite subgraph G, = (I,J, E) of G. If the bipartite graph G is connected, 
then the Gibbs sampling procedure on tiles of D is irreducible. 


Proof. We need to prove that for all pair of tiles 7 = (l, J1), T2 = (I2, J2) 
present in D, the Gibbs sampling procedure can go from Tı to T2. Let Gr, Gr 
be the complete bipartite graphs corresponding to Tı and T2. As G is connected, 
there is a path from any vertex of G, to any vertex of G,,. The probability that 
the sampling procedure walks through one of these paths is not 0, as each step of 
these paths constitutes a tile present in D. After walking on one of these paths, 
the procedure will find itself on a tile 7’ C 72. Reaching 72 from 7’ is probable 
after one iteration by sampling the right rows and then the right columns. 


Thus, if the bipartite graph G is connected, the Gibbs sampling procedure 
converges to a stationary distribution. To make the random walk converge when 
G is not connected, we can compute the connected components of G, and then 
apply Gibbs sampling separately in each corresponding subset of the dataset. 
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Table 2. Dataset characteristics. 


Dataset # rows | # columns | Avg. |row| 
mushrooms | 8124 120 24 
chess 3196 76 38 
kdd 843 6159 65.3 
100k sampled patterns 100k sampled patterns 100k sampled patterns 
a a100] a 
£ £ £ 
a a Š 2500 lites 
7 . 3 . 
; p5 pt : 
un un te un 
* eS M . . * . * 
Qo 01 02 03 04 05 go 02 0.4 go 0.2 0.4 
SI SI sı 


Fig. 1. Distribution of sampled patterns in synthetic data with 10 rows and 10 columns. 


5 Experiments 


We report our experimental study to evaluate the effectiveness of Gibbs-SI. Java 
source code is made available!. We consider three datasets whose characteris- 
tics are given in Table 2. mushrooms and chess from the UCI repository? are 
commonly used for evaluation purposes. kdd contains a set of SIGKDD paper 
abstracts between 2001 and 2008 downloaded from the ACM website. Each 
abstract is represented by a row and words correspond to columns, after stop 
word removal and stemming. For each dataset, the user priors that we represent 
in the SI background model are the row and column margins. In other terms, we 
consider that user knows (or, is already informed about) the following statistics: 
>»; D(i, j) for alli € T, and 57; D(i, j) for all j € J. 


Empirical Sampling Distribution. First, we want to experimentally evaluate 
how the Gibbs sampling distribution matches with the desired distribution. We 
need to run Gibbs-SI in small datasets where the size of T is not huge. Then, we 
take a sufficiently large number of samples so that the sampling distribution can 
be created. To this aim, we have synthetically generated a dataset containing 10 
rows, 10 columns, and 855 tiles. We run Gibbs-SI with three different numbers 
of iterations p: 1k, 10k, and 100k, for each case, we keep all the visited tiles, and 
we study their distribution w.r.t. their SI values. Figure1 reports the results. 
For 1k sampled patterns, the proportionality between the number of sampling 
and SI is not clearly established yet. For higher numbers of sampled patterns, 
a linear relation between the two axis is evident, especially for the case of 100k 
sampled patterns, which represents around 100 times the total number of all the 
tiles in the dataset. The two tiles with the highest SI are sampled the most, and 
the number of sampling clearly decreases with the SI value. 


1 http://tiny.cc/g5zmgz. 
? https://archive.ics.uci.edu/ml/. 
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Fig. 2. Distributions of the sampled patterns w.r.t. # rows, # columns and SI. 


Characteristics of Sampled Tiles. To investigate which kind of patterns are 
sampled by Gibbs-SI, we show in Fig. 2 the distribution of sampled tiles w.r.t 
their number of rows, columns, and their SI, for each of the three datasets given 
in Table 2. For mushrooms and chess, Gibbs-SI is able to return patterns with a 
diverse number of rows and columns. It samples much more patterns with low SI 
than patterns with high SI values. In fact, even if we are sampling proportionally 
to SI, the number of tiles in T with poor quality are significantly higher than 
the ones with high quality values. Thus, the probability of sampling one of low 
quality patterns is higher than sampling one of the few high quality patterns. 
For kdd, although the number of columns in sampled tiles varies, all the sampled 
tiles unfortunately cover only one row. In fact, the particularity of this dataset 
is the existence of some very large transactions (max = 180). 


Quality of the Sampled Tiles. In this part of the experiment, we want to 
study whether the quality of the top sampled tiles is sufficient. As mining exhaus- 
tively the best tiles w.r.t. SI is not feasible, we need to find some strategy 
that identifies high quality tiles. We propose to use LCM [14] to retrieve the 
closed tiles corresponding to the top 10k frequent closed itemsets. A closed tile 
T = (J, J) is a tile that is present in D and whose J and J cannot be extended 
anymore. Although closed tiles are not necessarily the ones with the highest SI, 
we make the hypothesis that at least some of them have high SI values as they 
maximize the value of IC function. For each of the three real world datasets, we 
compare between the SI of the top closed tiles identified with LCM and the ones 
identified with Gibbs-SI. In Table3, we show the SI of the top-1 tile, and the 
average SI of the top-10 tiles, for each of LCM and Gibbs-SI. 

Unfortunately, the scores of tiles retrieved with LCM are substantially larger 
than the ones of Gibbs-SI, especially for mushrooms and chess. Importantly, 
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Table 3. The SI of the top-1 tile, and the average SI of the top-10 tiles, found by 
LCM and Gibbs-SI in the studied datasets. 


Mushrooms Chess KDD 

Top 1 SlAvg(top 10 SI)/Top 1 SIAvg(top 10 SI)Top 1 SI|Avg(top 10 SI) 
Gibbs sampling.12 0.11 0.015 0.014 0.54 0.54 
LCM 3.89 3.20 0.40 0.40 0.83 0.70 


there may exist tiles that are even better than the ones found by LCM. This 
means that Gibbs-SI fails to identify the top tiles in the dataset. We believe 
that this is due to the very large number of low quality tiles which trumps the 
number of high quality tiles. The probability of sampling a high-quality tile is 
exceedingly small, necessitating a practically too large sample to identify any. 


6 Discussion 


Our results show that efficiently sampling from the set of tiles with a sampling 
probability proportional to the tiles’ subjective interestingness is possible. Yet, 
they also show that if the purpose is to identify some of the most interesting 
patterns, direct pattern sampling may not be a good strategy. The reason is that 
the number of tiles with low subjective interestingness is vastly larger that those 
with high subjective interestingness. This imbalance is not sufficiently offset 
by the relative differences in their interestingness and thus in their sampling 
probability. As a result, the number of tiles that need to be sampled in order 
to sample one of the few top interesting ones is of the same order as the total 
number of tiles. 

To mitigate this, one could attempt to sample from alternative distributions 
that attribute an even higher probability to the most interesting patterns, e.g. 
with probabilities proportional to the square or other high powers of the sub- 
jective interestingness. We speculate, however, that the computational cost of 
sampling from such more highly peaked distributions will also be larger, undoing 
the benefit of needing to sample fewer of them. This intuition is supported by 
the fact that direct sampling schemes according to itemset support are compu- 
tationally cheaper than according to the square of their support [2]. 

That said, the use of sampled patterns as features for downstream machine 
learning tasks, even if these samples do not include the most interesting ones, 
may still be effective as an alternative to exhaustive pattern mining. 


7 Conclusions 


Pattern sampling has been proposed as a computationally efficient alternative to 
exhaustive pattern mining. Yet, existing techniques have been limited in terms 
of which interestingness measures they could handle efficiently. 


Gibbs Sampling Subjectively Interesting Tiles 91 


In this paper, we introduced an approach based on Gibbs sampling, which is 
capable of sampling from the set of tiles proportional to their subjective inter- 
estingness. Although we present this approach for a specific type of pattern 
language and quality measure, we can relatively easily follow the same scheme 
to apply Gibbs sampling for other pattern mining settings. The empirical evalua- 
tion demonstrates effectiveness, yet, it also reveals a potential weakness inherent 
to pattern sampling: when the number of interesting patterns is vastly outnum- 
bered by the number of non-interesting ones, a large number of samples may 
be required, even if the samples are drawn with a probability proportional to 
the interestingness. Investigating our conjecture that this problem affects all 
approaches for sampling interesting patterns (for sensible measures of interest- 
ingness) seems a fruitful avenue for further research. 
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Abstract. A naive implementation of k-means clustering requires com- 
puting for each of the n data points the distance to each of the k cluster 
centers, which can result in fairly slow execution. However, by storing 
distance information obtained by earlier computations as well as informa- 
tion about distances between cluster centers, the triangle inequality can 
be exploited in different ways to reduce the number of needed distance 
computations, e.g. [3-5,7,11]. In this paper I present an improvement of 
the Exponion method [11] that generally accelerates the computations. 
Furthermore, by evaluating several methods on a fairly wide range of 
artificial data sets, I derive a kind of map, for which data set parameters 
which method (often) yields the lowest execution times. 


Keywords: Exact k-means - Triangle inequality - Exponion 


1 Introduction 


The k-means algorithm [9] is, without doubt, the best known and (among) the 
most popular clustering algorithm(s), mainly because of its simplicity. However, 
a naive implementation of the k-means algorithm requires O(nk) distance com- 
putations in each update step, where n is the number of data points and k is the 
number of clusters. This can be a severe obstacle if clustering is to be carried 
out on truly large data sets with hundreds of thousands or even millions of data 
points and hundreds to thousands of clusters, especially in high dimensions. 

Hence, in our “big data” age, considerable effort was spent on trying to 
accelerate the computations, mainly by reducing the number of needed distance 
computations. This led to several very clever approaches, including [3-5,7, 11]. 
These methods exploit that for assigning data points to cluster centers knowing 
actual distances is not essential (in contrast to e.g. fuzzy c-means clustering [2]). 
All one really needs to know is which center is closest. This, however, can some- 
times be determined without actually computing (all) distances. 

A core idea is to maintain, for each data point, bounds on its distance to 
different centers, especially to the closest center. These bounds are updated by 
© The Author(s) 2020 
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exploiting the triangle inequality, and can enable us to ascertain that the center 
that was closest before the most recent update step is still closest. Furthermore, 
by maintaining additional information, tightening these bounds can sometimes 
be done by looking at only a subset of the cluster centers. 

In this paper I present an improvement of one of the most sophisticated of 
such schemes: the Exponion method [11]. In addition, by comparing my new 
approach to other methods on several (artificial) data sets with a wide range of 
number of dimensions and number of clusters, I derive a kind of map, for which 
data set parameters which method (often) yields the lowest execution times. 


2 k-Means Clustering 


The k-means algorithm is a very simple, yet effective clustering scheme that 
finds a user-specified number & of clusters in a given data set. This data set is 
commonly required to consist of points in a metric space. The algorithm starts 
by choosing an initial set of k cluster centers, which may naively be obtained 
by sampling uniformly at random from the given data points. In the subsequent 
cluster center optimization phase, two steps are executed alternatingly: (1) each 
data point is assigned to the cluster center that is closest to it (that is, closer 
than any other cluster center) and (2) the cluster centers are recomputed as 
the vector means of the data points assigned to them (to enable these mean 
computations, the data points are supposed to live in a metric space). 

Using v(x) to denote the cluster center m-th closest to a point x in the 
data space, this update scheme can be written (for n data points 7,...,%n) as 


iar Li (es) = Ch) Tj 
Dja LO (as) = ef)” 


where the indices t and t + 1 indicate the update step and the function 1(¢) 
yields 1 if ¢ is true and 0 otherwise. Here v{(x;) represents the assignment step 
and the fraction computes the mean of the data points assigned to center c;. 

It can be shown that this update scheme must converge, that is, must reach a 
state in which another execution of the update step does not change the cluster 
centers anymore [14]. However, there is no guarantee that the obtained result is 
optimal in the sense that it yields the smallest sum of squared distances between 
the data points and the cluster centers they are assigned to. Rather, it is very 
likely that the optimization gets stuck in a local optimum. It has even been 
shown that k-means clustering is NP-hard for 2-dimensional data [10]. 

Furthermore, the quality of the obtained result can depend heavily on the 
choice of the initial centers. A poor choice can lead to inferior results due to a 
local optimum. However, improvements of naively sampling uniformly at random 
from the data points are easily found, for example the Maximin method [8] and 
the k-means++ procedure [1], which has become the de facto standard. 


Vizl<i<k: n= 
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3 Bounds-Based Exact k-Means Clustering 


Some approaches to accelerate the k-means algorithm rely on approximations, 
which may lead to different results, e.g. [6,12,13]. Here, however, I focus on 
methods to accelerate exact k-means clustering, that is, methods that, starting 
from the same initialization, produce the same result as a naive implementation. 


Fig. 1. Using the triangle inequality to update the distance bounds for a data point xj. 


The core idea of these methods is to compute for each update step the dis- 
tance each center moved, that is, the distance between the new and the old 
location of the center. Applying the triangle inequality one can then derive how 
close or how far away an updated center can be from a data point in the worst 
possible case. For this we distinguish between the center closest (before the 
update) to a data point z; on the one hand and all other centers on the other. 


k Distance Bounds. The first approach along these lines was developed in [5] 
and maintains one distance bound for each of the k cluster centers. 

For the center closest to a data point xj an upper bound us on its distance 
is updated as shown in Fig. l(a): If we know before the update that the distance 
between x; and ne closest center ch, = =p) is (at most) uj, and the update 
moved the center cj, to the new isccuion cj, then the distance d(x;, cj) between 
the data pon ad the new location of this center! cannot be greater than 

ut = = u$ +d(ct Chis ci). This bound is actually reached if before the update the 
bound was tight and the center Ch moves away from the data point x; on the 
straight line through z; and ci, (that is, if the triangle is “flat”). 

For all other centers, that is, centers that are not closest to the point xj, 
lower bounds £;i, i = 2,...,k, are updated as shown in Fig. ee If we know 
before the update that the distance between 7 and a center ch; = vj (x,), is (at 
least) ¢¢, ji: and the update moved the center Chi to the new jaca Chis then the 


distance rar cj;) between the data point and the new location of this center 
cannot be less than C= = li — d(c chi). This bound is actually reached if 
before the update = pound: was tight and the center c$; moves towards the 
data point xj on the straight line through x; and cj, (“flat” triangle). 


1 Note that it may be CH iF ela (although equality is not ruled out either), because 
the update may have ee which cluster center is closest to the data point zj. 
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These bounds are easily exploited to avoid distance computations for a data 
point xj: If we find that uit! < g = min*_, or that is, if the upper bound 
on the distance to the center that was closest before the update (in step t) is less 
than the smallest lower bound on the distances to any other center, the center 
that was closest before the update must still be closest after the update (that is, 
in step t + 1). Intuitively: even if the worst possible case happens, namely if the 
formerly closest center moves straight away from the data point and the other 
centers move straight towards it, no other center can have been brought closer 
than the one that was already closest before the update. 

And even if this test fails, one first computes the actual distance between 
the data point x; and chi. That is, one tightens the bound ui to the actual 
distance and then reevaluates the test. If it succeeds now, the center that was 
closest before the update must still be closest. Only if the test fails also with 
the tightened bound, the distances between the data point and the remaining 
cluster centers have to be computed in order to find the closest center and to 
reinitialize the bounds (all of which are tight after such a computation). 

This scheme leads to considerable acceleration, because the cost of computing 
the distances between the new and the old locations of the cluster centers as 
well as the cost of updating the bounds is usually outweighed by the distance 
computations that are saved in those cases in which the test succeeds. 


2 Distance Bounds. A disadvantage of the scheme just described is that 
k bound updates are needed for each data point. In order to reduce this cost, 
in [7] only two bounds are kept per data point: uj and £, that is, all non-closest 
centers are captured by a single lower bound. This bound is updated according to 
et = Li — max} a d(ci,, cf). Even though this leads to worse lower bounds for 
the non-closest centers (since they are all treated as if they moved by the max- 
imum of the distances any one of them moved), the fact that only two bounds 


have to be updated leads to faster execution, at least in many cases. 


Yin Yang Algorithm. Instead of having either one distance bound for each cen- 
ter (k bounds) or capturing all non-closest centers by a single bound (2 bounds), 
one may consider a hybrid approach that maintains lower bounds for subsets of 
the non-closest centers. This improves the quality of bounds over the 2 bounds 
approach, because bounds are updated only by the maximum distance a center 
in the corresponding group moved (instead of the global maximum). On the 
other hand, (considerably) fewer than k bounds have to be updated. 

This is the idea of the YinYang algorithm [4], which forms the groups of 
centers by clustering the initial centers with k-means clustering. The number of 
groups is chosen as k/10 in [4], but other factors may be tried. The groups found 
initially are maintained, that is, there is no re-clustering after an update. 

However, apart from fewer bounds (compared to k bounds) and better bounds 
(compared to 2 bounds), grouping the centers has yet another advantage: If the 
bounds test fails, even with a tightened bound u$, the groups and their bounds 
may be used to limit the centers for which a distance recomputation is needed. 
Because if the test succeeds for some group, one can infer that the closest center 
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Fig. 2. If Bu, < d(chi, v3" (chi), then the center cH must still be closest to the data 
point xj, due to the triangle inequality. 


ring/(hyper-)annulus 
searched for the two 
centers closest to x; 


ô; = d(c, va(cji)) 


Fig. 3. Annular algorithm [3]: If even after the upper bound uj for the distance from 
data point x; to its (updated) formerly closest center cH has been made tight, the lower 
bound £; for distances to other centers is still lower, it is necessary to recompute the 
two closest centers. Exploiting information about the distance between Éi and another 
center V2(cF1) closest to it, these two centers are searched in a (hyper-)annulus around 
the origin (dot in the bottom left corner) with c{j in the middle and thickness 26;, 
where 0; = 2u; +6; and 6; = d(cij, v2(c$1)). (Color figure online) 


cannot be in that group. Only centers in groups, for which the group-specific 
test fails, need to be considered for recomputation. 


Cluster to Cluster Distances. The described bounds test can be improved 
by not only computing the distance each center moved, but also the distances 
between (updated) centers, to find for each center another center that is closest to 
it [5]. With my notation I can denote such a center as v5" (ct), that is, the center 
that is second closest? to the point ci. Knowing the distances d(ci4, un (ct)), 
one can test whether 2uj** < d(chi, ue (ct)). If this is the case, the center that 


was closest to the data point x; before the update must still be closest after, as 


? Note that wees) = ci, because a center is certainly the center closest to itself. 
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cl* 
qt 
lie on a straight line with Chi and v. ï) on opposite sides of zj). 

Note that this second test can be used with k as well as with 2 bounds. 
However, it should also be noted that, although it can lead to an acceleration, 


if used in isolation it may also make an algorithm slower, because of the O(k?) 


distance computations needed to find the k distances d(c; i+1 T (aes 


oe 


is illustrated in Fig. 2 for the worst possible case (namely zj, ci’ and v 


oe 


ci) 


Annular Algorithm. With the YinYang algorithm an idea appeared on the 
scene that is at the focus of all following methods: try to limit the centers that 
need to be considered in the recomputations if the tests fail even with a tightened 
bound tg Especially, if one uses the 2 bounds approach, significant gains may 
be bied: all we need to achieve in this case is to find ttt = vitt(x;) and 
cit! = vh*"(x;), that is, the two centers closest to zy, berans these are all that 
is needed for che assignment step as well as for the (tight) bounds ‘i and Gr 

One such approach is the Annular algorithm [3]. For its description, as gen- 
erally in the following, I drop the time step indices t + 1 in order to simplify 
the notation. The Annular algorithm relies on the following idea: if the tests 
described above fail with a tightened bound uj, we cannot infer that Chi is still 
the center closest to zj. But we know that the closest center must lie in (hyper-) 
ball with radius u; around x; (darkest circle in Fig: 3). Any center outside this 
(hyper-)ball cannot be closest to xj, because cji is closer. Furthermore, if we 
know the distance to another center closest to c{;, that is, v2(c{), we know that 
even in the worst possible case (which is depicted in Fig.3: xj, cj} and v2(cj}) 
lie on a straight line), the two closest centers must lie in a (hyper-)ball with 
radius uj + 6; around zj, where 6; = d(cf,v2(cf,)) (medium circle in Fig. 3), 
because we already know two centers that are this close, namely Ci and V2 (chi). 
Therefore, if we know the distances of the centers from the origin, we can easily 
restrict the recomputations to those centers that lie in a (hyper-)annulus (hence 
the name of this algorithm) around the origin with Ch in the middle and thick- 
ness 20;, where 6; = 2u; +ô; with 6; = d(cif, va(cf1)) (see Fig. 3, light gray ring 
section, origin in the attain, left corner; note that the green line is perpendicular 
to the ted) blue lines only by accident/for drawing convenience). 


Exponion Algorithm. The Exponion algorithm [11] improves over the Annular 
algorithm by switching from annuli around the origin to (hyper-)balls around 
the (updated) formerly closest center Ci Again we know that the center closest 
to x; must lie in a (hyper-)ball with radius u; around x; (darkest circle in Fig. 4) 
and that the two closest centers must lie in a (hyper-)ball with radius uj + 4, 
around aj, where 6; = d(c{j,v2(cji)) (medium circle in Fig. 4). Therefore, if 
we know the pairwise distances between the (updated) centers, we can easily 
restrict the recomputations i those centers that lie in the (hyper-)ball with 
radius rj = 2u; + 6; around c$} (lightest circle in Fig. 4). 

The Exponion alportita ase relies on a scheme with which it is avoided 
having to sort, for each cluster center, the lists of the other centers by their 
distance. For this concentric annuli, one set centered at a each center, are created, 
with each annulus further out containing twice as many centers as the preceding 
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circle/(hyper-)ball 
searched for the two 
centers closest to x; 


Fig. 4. Exponion algorithm [11]: If even after the upper bound uy; for the distance from 
a data point xj to its (updated) formerly closest center cj} has been made tight, the 
lower bound £; for distance to other centers is still lower, it is necessary to recompute 
the two closest centers. Exploiting information about the distance between ct and 
another center v2(c$1) closest to it, these two centers are searched in a (hyper-)sphere 
around center chi with radius r; = 2u; + 6; where 6; = d(cjj,v2(cj1)). (Color figure 
online) 


one. Clearly this creates an onion-like structure, with an exponentially increasing 
number of centers in each layer (hence the name of the algorithm). 

However, avoiding the sorting comes at a price, namely that more centers may 
have to be checked (although at most twice as many [11]) for finding the two 
closest centers and thus additional distance computations ensue. In my imple- 
mentation I avoided this complication and simply relied on sorting the distances, 
since the gains achievable by concentric annuli over sorting are somewhat unclear 
(in [11] no comparisons of sorting versus concentric annuli are provided). 


Shallot Algorithm. The Shallot algorithm is the main contribution of this 
paper. It starts with the same considerations as the Exponion algorithm, but 
adds two improvements. In the first place, not only the closest center cj; and 
the two bounds uj and £; are maintained for each data point (as for Exponion), 
but also the second closest center cj2. This comes at practically no cost (apart 
from having to store an additional integer per data point), because the second 
closest center has to be determined anyway in order to set the bound £j. 

If a recomputation is necessary, because the tests fail even for a tightened uj, 
it is not automatically assumed that ci} is the best center z for a (hyper-)ball 
to search. As it is plausible that the formerly second closest center Ch may now 


be closer to x; than cj}, the center cj} is processed first among the centers ci 


ji? 
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i=2,...,k. If it turns out that it is actually closer to x; than cf}, then ci is 
chosen as the center z of the (hyper-)ball to check. In this case the (hyper-)ball 
will be smaller (since we found that d(x,;,cj5) < d(x;,cj1)). For the following, 
let p denote the other (updated) center that was not chosen as the center z. 

The second improvement may be understood best by viewing the chosen 
center z of the (hyper-)ball as the initial candidate c*, for the closest center in 
step t+ 1. Hence we initialize uj = d(x;, z). For the initial candidate cj, for the 
second closest center in step t+ 1 we have two choices, namely p and v2(z). We 
choose ci, = p if uj +d(xj,p) < 2u +ô; and cj, = v2(z) otherwise, and initialize 
Lj =u;+d(a;,p) or lj = 2u; +ô; accordingly, thus minimizing the radius, which 
then can be written, regardless of the choice taken, as rj = uj + 6;. 

While traversing the centers in the constructed (hyper-)ball, better candi- 
dates may be obtained. If this happens, the radius of the (hyper-)ball may be 
reduced, thus potentially reducing the number of centers to be processed. This 
idea is illustrated in Fig.5. Let uş be the initial value of uj when the (hyper-) 
ball center was chosen, but before the search is started, that is uj = d(x;, z). 
If a new closest center (candidate) cj, is found (see Fig.5(a)), we can update 
ui = = ata cn) and ¢; = d(xj,cj2) = us. Hence we can shrink the radius to 

; = 2u} = us + ¢;. If then an even soe center is found (see Fig. 5(b)), the 
radius may be Diak further as uj and £; are updated again. As should be clear 
from these examples, the radius is always r; = uj + £5. 


circle/(hyper-)ball 
searched for the two 
centers closest to £j 


circle/(hyper-)ball 
searched for the two 
centers closest to xj 


Fig. 5. Shallot algorithm: If a center closer to the data point than the two currently 
closest centers is found, the radius of the (hyper-)ball to be searched can be shrunk. 


A shallot is a type of onion, smaller than, for example, a bulb onion. I chose 
this name to indicate that the (hyper-)ball that is searched for the two closest 
centers tends to be smaller than for the Exponion algorithm. The reference to an 
onion may appear misguided, because I rely on sorting the list of other centers 
by their distance for each cluster center, rather than using concentric annuli. 
However, an onion reference may also be justified by the fact that my algorithm 
may shrink the (hyper-)ball radius during the traversal of centers in the (hyper-) 
ball, as this also creates a layered structure of (hyper-)balls. 
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4 Experiments 


In order to evaluate the performance of the different exact k-means algorithms 
I generated a large number of artificial data sets. Standard benchmark data sets 
proved to be too small to measure performance differences reliably and would also 
not have permitted drawing “performance maps” (see below). I fixed the number 
of data points in these data sets at n = 100 000. Anything smaller renders the 
time measurements too unreliable, anything larger requires an unpleasantly long 
time to run all benchmarks. Thus I varied only the dimensionality m of the 
data space, namely as m € {2,3,4,5,6,8, 10, 15, 20, 25, 30,35, 40, 45,50}, and 
the number k of clusters, from 20 to 300 in steps of 20. For each parameter 
combination I generated 10 data sets, with clusters that are (roughly, due to 
random deviations) equally populated with data points and that may vary in 
size by a factor of at most ten per dimension. All clusters were modeled as 
isotropic normal (or Gaussian) distributions. Each data set was then processed 
10 times with different initializations. All optimization algorithms started from 
the same initializations, thus making the comparison as fair as possible. 

The clustering program is written in C (however, there is also a Python ver- 
sion, see the link to the source code below). All implementations of the different 
algorithms are entirely my own and use the same code to read the data and to 
write the clustering results. This adds to the fairness of the comparison, as in 
this way any differences in execution time can only result from differences of 
the actual algorithms. The test systems was an Intel Core 2 Quad Q9650@3GHz 
with 8 GB of RAM running Ubuntu Linux 18.04 64bit. 


[P Shallot (s) 
| Exponion (o) 
P YinYang + cluster to cluster (y) 


2 3 4 5 6 8 1015 20 25 30 35 40 45 50 


Fig. 6. Map of the algorithms that produced the best execution times over number of 
dimensions (horizontal) and number of clusters (vertical), showing fairly clear regions 
of algorithm superiority. Enjoyably, the Shallot algorithm that was developed in this 
paper yields the best results for the largest number of parameter combinations. 
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Fig. 7. Relative comparison between the Shallot algorithm and the Exponion algo- 
rithm. The left diagram refers to the number of distance computations, the right dia- 
gram to execution time. Blue means that Shallot is better, red that Exponion is better. 
(Color figure online) 


The results of these experiments are visualized in Figs.6, 7 and 8. Figure6 
shows on a grid spanned by the number of dimensions (horizontal axis) and the 
number of clusters inducted into the data set (vertical axis) which algorithm 
performed best (in terms of execution time) for each combination. Clearly, the 
Shallot algorithm wins most parameter combinations. Only for larger numbers 
of dimensions and larger numbers of clusters the YinYang algorithm is superior. 

In order to get deeper insights, Fig. 7 shows on the same grid a comparison 
of the number of distance computations (left) and the execution times (right) 
of the Shallot algorithm and the Exponion algorithm. The relative performance 


2 3 4 5 6 8 1015 20 25 30 35 40 45 50 2 3 4 5 6 8 10152025 3035404550 0.0 


Fig. 8. Variation of the execution times over number of dimensions (horizontal) and 
number of clusters (vertical). The left diagram refers to the Shallot algorithm, the right 
diagram to the Exponion algorithm. The larger variation for fewer clusters and fewer 
dimensions may explain the speckled look of Figs. 6 and 7. 
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Fig. 9. Relative comparison between the Shallot algorithm and the YinYang algorithm 
using the cluster to cluster distance test (pure YinYang is very similar, though). The left 
diagram refers to the number of distance computations, the right diagram to execution 
time. Blue means that Shallot is better, red that YinYang is better. (Color figure 
online) 


is color-coded: saturated blue means that the Shallot algorithm needed only 
half the distance computations or half the execution time of the Exponion algo- 
rithm, saturated red means that it needed 1.5 times the distance computations 
or execution time compared to the Exponion algorithm. 

W.r.t. distance computations there is no question who is the winner: the 
Shallot algorithm wins all parameter combinations, some with a considerable 
margin. W.r.t. execution times, there is also a clear region towards more dimen- 
sions and more clusters, but for fewer clusters and fewer dimensions the diagram 
looks a bit speckled. This is a somewhat strange result, as a smaller number of 
distance computations should lead to lower execution times, because the effort 
spent on organizing the search, which is also carried out in exactly the same 
situations, is hardly different between the Shallot and the Exponion algorithm. 

The reason for this speckled look could be that the benchmarks were carried 
out with heavy parallelization (in order to minimize the total time), which may 
have distorted the measurements. As a test of this hypothesis, Fig. 8 shows the 
standard deviation of the execution times relative to their mean. White means 
no variation, fully saturated blue indicates a standard deviation half as large as 
the mean value. The left diagram refers to the Shallot, the right diagram to the 
Exponion algorithm. Clearly, for a smaller number of dimensions and especially 
for a smaller number of clusters the execution times vary more (this may be, 
at least in part, due to the generally lower execution times for these parameter 
combinations). It is plausible to assume that this variability is the explanation 
for the speckled look of the diagrams in Fig. 6 and in Fig.7 on the right. 

Finally, Fig.9 shows, again on the same grid, a comparison of the number 
of distance computations (left) and the execution times (right) of the Shallot 
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algorithm and the YinYang algorithm (using the test based on cluster to cluster 
distances, although a pure YinYang algorithm performs very similarly). The 
relative performance is color-coded in the same way as in Fig. 7. Clearly, the 
smaller number of distance computations explains why the YinYang algorithm 
is superior for more clusters and more dimensions. 

The reason is likely that grouping the centers leads to better bounds. This 
hypothesis is confirmed by the fact that the Elkan algorithm (k distance bounds) 
always needs the fewest distance computations (not shown as a grid) and loses 
on execution time only due to having to update so many distance bounds. 


5 Conclusion 


In this paper I introduced the Shallot algorithm, which adds two improvements 
to the Exponion algorithm [11], both of which can potentially shrink the (hyper-) 
ball that has to be searched for the two closest centers if recomputation becomes 
necessary. This leads to a measurable, sometimes even fairly large speedup com- 
pared to the Exponion algorithm due to fewer distance computations. How- 
ever, for high-dimensional data and large numbers of clusters the YinYang algo- 
rithm [4] (with or without the cluster to cluster distance test) is superior to both 
algorithms. Yet, since clustering in high dimensions is problematic anyway due 
to the curse of dimensionality, it may be claimed reasonably confidently that the 
Shallot algorithm is the best choice for standard clustering tasks. 


Software. My implementation of the described methods (C and Python), with 
which I conducted the experiments, can be obtained under the MIT License at 
http://www.borgelt.net /cluster.html. 


Complete Results. A table with the complete experimental results I obtained 
can be retrieved as a simple text table at 
http://www.borgelt.net /docs/clsbench.txt. 


More maps comparing the performance of the algorithms can be found at 
http://www.borgelt.net /docs/clsbench.pdf. 
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Abstract. The emergence of specialized optimization hardware such as 
CMOS annealers and adiabatic quantum computers carries the promise 
of solving hard combinatorial optimization problems more efficiently in 
hardware. Recent work has focused on formulating different combina- 
torial optimization problems as Ising models, the core mathematical 
abstraction used by a large number of these hardware platforms, and 
evaluating the performance of these models when solved on specialized 
hardware. An interesting area of application is data mining, where com- 
binatorial optimization problems underlie many core tasks. In this work, 
we focus on consensus clustering (clustering aggregation), an important 
combinatorial problem that has received much attention over the last two 
decades. We present two Ising models for consensus clustering and evalu- 
ate them using the Fujitsu Digital Annealer, a quantum-inspired CMOS 
annealer. Our empirical evaluation shows that our approach outperforms 
existing techniques and is a promising direction for future research. 


1 Introduction 


The increasingly challenging task of scaling the traditional Central Processing 
Unit (CPU) has lead to the exploration of new computational platforms such 
as quantum computers, CMOS annealers, neuromorphic computers, and so on 
(see [3] for a detailed exposition). Although their physical implementations dif- 
fer significantly, adiabatic quantum computers, CMOS annealers, memristive 
circuits, and optical parametric oscillators all share Ising models as their core 
mathematical abstraction [3]. This has lead to a growing interest in the formula- 
tion of computational problems as Ising models and in the empirical evaluation 
of these models on such novel computational platforms. This body of literature 
includes clustering and community detection [14,19,23], graph partitioning [26], 
and many NP-Complete problems such as covering, packing, and coloring [17]. 

Consensus clustering is the problem of combining multiple ‘base clusterings’ 
of the same set of data points into a single consolidated clustering [9]. Consen- 
sus clustering is used to generate robust, stable, and more accurate clustering 
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results compared to a single clustering approach [9]. The problem of consensus 
clustering has received significant attention over the last two decades [9], and 
was previously considered under different names (clustering aggregation, clus- 
ter ensembles, clustering combination) [10]. It has applications in different fields 
including data mining, pattern recognition, and bioinformatics [10] and a number 
of algorithmic approaches have been used to solve this problem. The consensus 
clustering is, in essence, a combinatorial optimization problem [28] and different 
instances of the problem have been proven to be NP-hard (e.g., [6,25]). 

In this work, we investigate the use of special purpose hardware to solve the 
problem of consensus clustering. To this end, we formulate the problem of con- 
sensus clustering using Ising models and evaluate our approach on a specialized 
CMOS annealer. We make the following contributions: 


1. We present and study two Ising models for consensus clustering that can be 
solved on a variety of special purpose hardware platforms. 

2. We demonstrate how our models are embedded on the Fujitsu Digital 
Annealer (DA), a quantum-inspired specialized CMOS hardware. 

3. We present an empirical evaluation based on seven benchmark datasets and 
show our approach outperforms existing techniques for consensus clustering. 


2 Background 


2.1 Problem Definition 


Let X = {21,...,%,} bea set of n data points. A clustering of X is a process that 
partitions X into subsets, referred to as clusters, that together cover X. A clus- 
tering is represented by the mapping m : X — {1,...,k,} where k, is the number 
of clusters produced by clustering 7. Given X and a set IT = {m,...,7m} of 
m clusterings of the points in X, the Consensus Clustering Problem is to find 
a new clustering, 7*, of the data X that best summarizes the set of clusterings 
IT. The new clustering 1* is referred to as the consensus clustering. 

Due to the ambiguity in the definition of an optimal consensus clustering, sev- 
eral approaches have been proposed to measure the solution quality of consensus 
clustering algorithms [9]. In this work, we focus on the approach of determin- 
ing a consensus clustering that agrees the most with the original clusterings. As 
an objective measure to determine this agreement, we use the mean Adjusted 
Rand Index (ARI) metric (Eq. 14). However, we also consider clustering quality 
measured by mean Silhouette Coefficient [22] and clustering accuracy based on 
true labels. In Sect. 4 these evaluation criteria are discussed in more details. 


2.2 Existing Criteria and Methods 


Various criteria or objectives have been proposed for the Consensus Clustering 
Problem. In this work we mainly focus on two well-studied criteria, one based on 
the pairwise similarity of the data points, and the other based on the different 
assignments of the base clusterings. Other well-known criteria and objectives 
for the Consensus Clustering Problem can be found in the excellent surveys of 
[9,27], with most defining NP-Hard optimization problems. 
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Pairwise Similarity Approaches: In this approach, a similarity matrix S is con- 
structed such that each entry in S represents the fraction of clusterings in which 
two data points belong to the same cluster [20]. In particular, 


m 


Yo Uni(u) = m(v)), (1) 


i=l 


1 
m 


with 1 being the indicator function. The value Sws lies between 0 and 1, and is 
equal to 1 if all the base clusterings assign points u and v to the same cluster. 
Once the pairwise similarity matrix is constructed, one can use any similarity- 
based clustering algorithm on S$ to find a consensus clustering with a fixed num- 
ber of clusters, K. For example, [16] proposed to find a consensus clustering 7* 
with exactly K clusters that minimizes the within-cluster dissimilarity: 


min J (1 — Sw). (2) 
UjvEX: 
n*(u)=r* (v) 


Partition Difference Approaches: An alternative formulation is based on the 
different assignments between clustering. Consider two data points u,v € X, 
and two clusterings 7;,7; € I. The following binary indicator tests if m; and 7; 
disagree on the clustering of u and v: 


j= 


, if m;(u) = m;(v) and r;(u) Æ r;(v) 
, if m;(u) Æ mi(v) and m;(u) = r;(v) (3) 
0, otherwise. 


j= 


aü (Ti, Tj) = 


The distance between two clusterings is then defined based on the number of 
pairwise disagreements: 


d( (Ti, Tj) È du,v( (Ti, Ti) (4) 


with the 5 + factor to take care of double counting and can be ignored. This 
measure is ; defined as the number of pairs of points that are in the same cluster 
in one clustering and in different clusters in the other, essentially considering the 
(unadjusted) Rand index [9]. Given this measure, a common objective is to find 
a consensus clustering 7* with respect to the following optimization problem: 


min X d(T 7”). (5) 


Methods and Algorithms: The two different criteria given above define funda- 
mentally different optimization problems, thus different algorithms have been 
proposed. One key difference between the two approaches inherently lies in deter- 
mining the number of clusters kr» in x*. The pairwise similarity approaches (e.g., 
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Eq. (2)) require an input parameter K that fixes the number of clusters in 7*, 
whereas the partition difference approaches such as Eq. (5) do not have this 
requirement and determining k,~« is part of the objective of the problem. There- 
fore, for example, Eq. (2) will have a minimum value in the case when kr» = n, 
however this does not hold for Eq. (5). 

The Cluster-based Similarity Partitioning Algorithm (CSPA) is proposed in 
[24] for solving the pairwise similarity based approach. The CSPA constructs a 
similarity-based graph with each edge having a weight proportional to the simi- 
larity given by S. Determining the consensus clustering with exactly K clusters 
is treated as a K-way graph partitioning problem, which is solved by methods 
such as METIS [12]. In [20], the authors experiment with different clustering 
algorithms including hierarchical agglomerative clustering (HAC) and iterative 
techniques that start from an initial partition and iteratively reassign points to 
clusters based on their pairwise similarities. For the partition difference app- 
roach, Li et al. [15] proposed to solve Eq. (5) using nonnegative matrix factor- 
ization (NMF). Gionis et al. [10] proposed several algorithms that make use of 
the connection between Eq. (5) and the problem of correlation clustering. CSPA, 
HAC, NMF: these three approaches are considered as baseline in our empirical 
evaluation section (Sect. 4). 


2.3 Ising Models 


Ising models are graphical models that include a set of nodes representing spin 
variables and a set of edges corresponding to the interactions between the spins. 
The energy level of an Ising model which we aim to minimize is given by: 


E(o)= 5 Ji joiCj + 5 hici, (6) 


(i,j) EE ieN 


where the variables ø; € {—1,1} are the spin variables and the couplers, Ji j, 
represent the interaction between the spins. 

A Quadratic Unconstrained Binary Optimization (QUBO) model includes 
binary variables q; € {0,1} and couplers, c; j. The objective to minimize is: 


E(q) = 5 Ciqi + 5 Ci jlidj- (7) 


i=1 i<j 


QUBO models can be transformed to Ising models by setting o; = 2q; — 1 [2]. 


3 Ising Approach for Consensus Clustering on Specialized 
Hardware 


In this section, we present our approach for solving consensus clustering on 
specialized hardware using Ising models. We present two Ising models that cor- 
respond to the two approaches in Sect. 2.2. We then demonstrate how they can 
be solved on the Fujitsu Digital Annealer (DA), a specialized CMOS hardware. 
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3.1 Pairwise Similarity-Based Ising Model 


For each data point u € X, let duc € {0,1} be the binary variable such that 
duc = 1 if z* assigns u to cluster c, and 0 otherwise. Then the constraints 


K 
XO duc=1, for each u € X (8) 
c=1 


ensure 7™* assigns each point to exactly one cluster. Subject to the constraints 
(8), the sum of quadratic terms De ducQue is 1 if x* assigns both u,v € X to 
the same cluster, and is 0 if assigned to different clusters. Therefore the value 


K 
5 (1 — Suv) = 5 (1 = Suv) 5 ducQuc (9) 


U,vEXx: Uvex 
n*(u)=r* (v) 
represents the sum of within-cluster dissimilarities in 7*: (1 — Swv) is the fraction 
of clusterings in J that assign u and v to different clusters while 7* assigns them 
to the same cluster. We therefore reformulate Eq. (2) as QUBO: 


K K 
min 5 (1 _ Suv) 5 ducdvue + 5 AÒ quc — tee (10) 


Uvex UuEx c=1 


where the term J ex A(Z que — 1)? is added to the objective function to 
ensure that the constraints (8) are satisfied. A is positive constant that penalizes 
the objective for violations of constraints (8). One can show that if A > n, the 
optimal solution of the QUBO in Eq. (10) does not violate the constraints (8). 
The proof is very similar to proof of Theorem 1 and a similar result in [14]. 


3.2 Partition Difference Ising Model 


The partition difference approach essentially considers the (unadjusted) Rand 
Index [9] and therefore can be expected to perform better. The Correlation 
Clustering Problem is another important problem in data mining. Gionis et 
al. [10] showed that Eq. (5) is a restricted case of the Correlation Clustering 
Problem, and that Eq. (5) can be expressed as the following equivalent form of 
the Correlation Clustering Problem 


mn XO (1-Sw)+ YO Sw. (11) 
T u, vEX: u,vEX: 
n*(u)=r* (v) n*(u)én*(v) 
We take advantage of this equivalence to model Eq. (5) as a QUBO. In a similar 
fashion to the QUBO formulated in the preceding subsection, the terms 


5 Suo = 5 Su 5 ducQvl (12) 


Uvex: Uvex 1<cAl<K 
n*(u)en* (v) 
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measure the similarity between points in different clusters, where K represents 
an upper bound for the number of clusters in 7*. This then leads to the mini- 
mizing the following QUBO: 


K K 
D (1 = Suv) 5 ducdvuc alr 5 Suo 5 ducQvl T 5 BO due — 1)’. 
c=1 


u,vEX U,vEx 1<c#l<K wEX c=1 
(13) 


Intuitively, Eq. (13) measures the disagreement between the consensus clus- 
tering and the clusterings in M. This disagreement is due to points that are 
clustered together in the consensus clustering but not in the clusterings in I, 
however it is also due to points that are assigned to different clusters in the 
consensus partition but in the same cluster in some of the partitions in M. 

Formally, we can show that Eq. (13) is equivalent to the correlation clustering 
formulation in Eq. (11) when setting B > n. Consistent with other methods that 
optimize Eq. (5) (e.g., [15]), our approach takes as an input K, an upper bound 
on the number of clusters in 7*, however the obtained solution can use smaller 
number of clusters. In our proof, we assume K is large enough to represent the 
optimal solution, i.e., greater than the number of clusters in optimal solutions 
to the correlation clustering problem in Eq. (11). 


Theorem 1. Let q be the optimal solution to the QUBO given by Eq. (13). 
If B > n, for a large enough K < n, an optimal solution to the Correlation 
Clustering Problem in Eq. (11), 7, can be efficiently evaluated from q. 


Proof. First we show the optimal solution to the QUBO in Eq. (13) satisfies 
the one-hot encoding (3°; dux = 1). This would imply given q we can create a 
valid clustering 7. Note, the optimal solution will never have $`, duc > 1 as it 
can only increase the cost. The only case in which an optimal solution will have 
yo. due < 1 is when the cost of assigning a point to a cluster is higher than the 
cost of not assigning it to a cluster (i.e., the penalty B). Assigning a point u to 
a cluster will incur a cost of (1 — Suv) for each point v in the same cluster and 
Suv for each point v that is not in the cluster. As there is additional n — 1 points 
in total, and both (1 — Suv) and Sy, are less or equal to one (Eq. (1)), setting 
B > n guarantees the optimal solution satisfies the one-hot encoding. 

Now we assume that 7 is not optimal, i.e., there exists an optimal solution 
î to Eq. (11) that has a strictly lower cost than 7. Let q be the corresponding 
QUBO solution to 7, such that 7(u) = k if and only if qux = 1. This is possible 
because K is large enough to accomodate all clusters in 7. As both q and q 
satisfy that one-hot encoding (penalty terms are zero), their cost is identical to 
the cost of 7 and 7 . Since the cost of 7 is strictly lower than 7, and the cost of 
q is lower or equal to q, we have a contradiction. 


3.3 Solving Consensus Clustering on the Fujitsu Digital Annealer 


The Fujitsu Digital Annealer (DA) is a recent CMOS hardware for solving com- 
binatorial optimization problems formulated as QUBO [1,8]. We use the second 
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generation of the DA that is capable of representing problems with up to 8192 
variables with up to 64 bits of precision. The DA has previously been used to 
solve problems in areas such as communication [18] and signal processing [21]. 
The DA algorithm [1] is based on simulated annealing (SA) [13], while taking 
advantage of the massive parallelization provided by the CMOS hardware [1]. It 
has several key differences compared to SA, most notably a parallel-trial scheme 
in which each MC step considers all possible one-bit flips in parallel and dynamic 
offset mechanism that increase the energy of a state to escape local minima [1]. 


Encoding Consensus Clustering on the DA. When embedding our Ising 
models on the DA, we need to consider the hardware specification and adapt the 
representation of our model accordingly. Due to hardware precision limit, we need 
to embed the couplers and biases on an integer scale with limited granularity. 
In our experiments, we normalize the pairwise costs Suy in the discrete range 
[0,100], Di; = [Suv - 100], and accordingly (1 — Suv) is replaced by (100 — Duy). 
Note that the theoretical bound B = n is adjusted accordingly to be B = 100-n. 

The theoretical bound guarantees that all constraints are satisfied if problems 
are solved to optimality. In practice, the DA does not necessarily solve problems 
to optimality and due to the nature of annealing-based algorithms, using very 
high weights for constraints is likely to create deep local minima and result 
in solutions that may satisfy the constraints but are often of low-quality. This 
is especially relevant to our pairwise similarity model where the bound tends 
to become loose as the number of clusters grows. In our experiments, we use 
constant, reasonably high, weights that were empirically found to perform well 
across datasets. For the pairwise similarity-based model (Eq. (10)) we use A = 
214, and for the partition difference model (Eq. (13)) we use B = 21°. While we 
expect to get better performance by tuning the weights per-dataset, our goal is 
to demonstrate the performance of our approach in a general setting. Automatic 
tuning of the weight values for the DA is a direction for future work. 

Unlike many of the existing consensus clustering algorithms that run until 
convergence, our method runs for a given time limit (defined by the number of 
runs and iterations) and returns the best solution encountered. In our experi- 
ments, we arbitrarily choose three seconds as a (reasonably short) time limit to 
solve our Ising models. As with the weights, we employ a single temperature 
schedule across all datasets, and do not tune it per dataset. 


4 Empirical Evaluation 


We perform an extensive empirical evaluation of our approach using a set of seven 
benchmark datasets. We first describe how we generate the set of clusterings, 
IT. Next, we describe the baselines, the evaluation metrics, and the datasets. 


Generating Partitions. We follow [7] and generate a set of clusterings by 
randomizing the parameters of the K-Means algorithm, namely the number of 
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clusters K and the initial cluster centers. In this work, we only use labelled 
datasets for which we know the number of clusters, K, based on the true labels. 
To generate the base clusterings we run the K-Means algorithm with random 
cluster centers and we randomly choose K from the range [2, 3K]. For each 
dataset, we generate 100 clusterings to serve as the clustering set I. 


Baseline Algorithms. We compare our pairwise similarity-based Ising model, 
referred to as DA-Sm, and our correlation clustering Ising model, referred to as 
DA-Cr, to three popular algorithms for consensus clustering: 


1. The cluster-based similarity partitioning algorithm (CSPA) [24] solved as a 
K-way graph partitioning problem using METIS [12]. 

2. The nonnegative matrix factorization (NMF) formulation in [15]. 

3. Hierarchical agglomerative clustering (HAC) starts with all points in single- 
ton clusters and repeatedly merges the two clusters with the largest average 
similarity based on S, until reaching the desired number of clusters [20]. 


Evaluation. We evaluate the different methods using three measures. Our main 
concern in this work is the level of agreement between the consensus clustering 
and the set of input clusterings. To this end, one requires a metric measuring the 
similarity of two clusterings that can be used to measure how close the consensus 
clustering 7* to each base clustering m; € IT is. One of popularly used metrics 
to measure the similarity between two clusterings is the Rand Index (RI) and 
Adjusted Rand Index (ARI) [11]. The Rand Index of two clustering lies between 0 
and 1, obtaining the value 1 when both clusterings perfectly agree. Likewise, the 
maximum score of ARI, which is corrected-for-chance version of RI, is achieved 
when both clusterings perfectly agree. ARI(7;,7*) can be viewed as measure 
of agreement between the consensus clustering 7* and some base clusterings 
mti € IT. We use the mean ARI as the main evaluation criteria: 


= XO ARI (mi, 0") (14) 
at i=1 


We also evaluate 7* based on clustering quality and accuracy. For clustering 
quality, we use the mean Silhouette Coefficient [22] of all data points (computed 
using the Euclidean distance between the data points). For clustering accuracy, 
we compute the ARI between the consensus partition 7* and the true labels. 


Benchmark Datasets. We run experiments on seven datasets with differ- 
ent characteristics: Iris, Optdigits, Pendigits, Seeds, Wine from the UCI reposi- 
tory [5] as well as Protein [29] and MNIST.' Optdigits-389 is a randomly sampled 
subset of Optdigits containing only the digits {3,8,9}. Similarly, MNIST-3689 
and Pendigits-149 are subsets of the MNIST and Pendigits datasets. 


1 http: //yann.lecun.com/exdb/mnist/. 


114 E. Cohen et al. 


Table 1 provides statistics on each of the data set, with the coefficient of vari- 
ation (CV) [4] describing the degree of class imbalance: zero indicates perfectly 
balanced classes, while higher values indicate higher degree of class imbalance. 


Table 1. Datasets 


Dataset # Instances | # Features | # Clusters | CV 

Tris 150 4 3 0.000 
MNIST-3689 | 389 784 4 0.015 
Optdigits-389 | 537 64 3 0.021 
Pendigits-149 | 532 16 3 0.059 
Protein 116 20 6 0.301 
Seeds 210 7 3 0.000 
Wine 178 13 3 0.158 

4.1 Results 


We compare the baseline algorithms to the two Ising models in Sect. 3 solved 
using the Fujitsu Digital Annealer described in Sect. 3.3. 

Clustering is typically an unsupervised task and the number of clusters is 
unknown. The number of clusters in the true labels, K, is not available in real 
scenarios. Furthermore, K is not necessarily the best value for clustering tasks 
(e.g., in many cases it is better to have smaller clusters that are more pure). We 
therefore test the algorithms in two configurations: when the number of clusters 
is set to K, as in the true labels, and when the number of clusters is set to 2K. 


Table 2. Consensus performance measured by mean ARI across partitions 


Dataset K clusters 2K clusters 
CSPA|NMF HAC |DA-Sm|DA-Cr|CSPA|NMF |HAC |DA-Sm|DA-Cr 
Tris 0.555 0.618 0.618 0.619 |0.621 |0.536 |0.614 |0.627 |0.608 /|0.642 
MNIST 0.459 0.449 0.469 0.474 |0.474 |0.456 |0.511 |0.517 |0.490 |0.521 
Optdig. 0.528 |0.550/0.541 |0.550 |0.551 |0.492 |0.596 |0.608 |0.576 |0.612 
Pendig. 0.546 0.546 0.507 0.555 /0.555 |0.531 |0.629 |0.642/0.605 |0.644 
Protein 0.344 |0.393 0.379 0.390 |0.405 |0.324 |0.419 |0.423/0.378 10.415 

Seeds |0.558 |0.577/0.534 |0.575 |0.577 |0.484 |0.602 |0.602 |0.580 |0.612 
Wine (0.481 |0.536 0.535 0.537 |0.538 |0.502 |0.641|0.641/0.641 |0.643 
# Best |0 4 1 6 7 0 1 3 1 6 
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Consensus Criteria. Table2 shows the mean ARI between 7* and the clus- 
terings in J. To avoid bias due to very minor differences, we consider all the 
methods that achieved Mean ARI that is within a threshold of 0.0025 from the 
best method to be equivalent and highlight them in bold. We also summarize the 
number of times each method was considered best across the different datasets. 
_ The results show that DA-Cr is the best performing method for both K and 
2K clusters. The results of DA-Sm are not consistent: DA-Sm and NMF are 
performing well for K clusters and HAC is performing better for 2K clusters. 


Clustering Quality. Table 3 report the mean Silhouette Coefficient of all data 
points. Again, DA-Cr is the best performing method across datasets, followed 
by HAC. NMF seems to be equivalent to HAC for 2K. 


Table 3. Clustering quality measured by Silhouette 


Dataset K clusters 2K clusters 
CSPA NMF HAC |DA-Sm|DA-Cr|CSPA|NMF |HAC |DA-Sm|DA-Cr 
Tris 0.519 0.555 0.555)0.551 |0.553 |0.289 |0.366 |0.371/0.343 |0.373 
MNIST 0.075 0.072 0.078 0.079 |0.078 |0.069 |0.082/0.074 |0.074 |0.082 
Optdig. 0.127 0.120 |0.120 |0.130 |0.130 /|0.088 |0.119/0.119/0.112 |0.121 

Pendig. |0.307 0.307 0.315 /0.310 |0.310 /0.305 |0.332 |0.375|0.368 /0.364 

Protein 0.074 |0.106/0.095 0.094 |0.104 |0.068 |0.111 |0.115 |0.119 |0.118 
Seeds (0.461 0.468 |0.410 |0.469 |0.472 |0.275 |0.343/0.304 |0.344 |0.302 

Wine (0.453 |0.542 0.571 0.547 |0.545 |0.452 |0.543|0.541|0.539 |0.542 
# Best 0 2 4 2 5 0 4 4 2 5 


Clustering Accuracy. Table 4 shows the clustering accuracy measured by the 
ARI between 7* and the true labels. For K, we find DA-Sm to be best-performing 
solution (followed by DA-Cr). For 2K, DA-Cr outperforms the other methods. 
Interestingly, there is no clear winner between CSPA, NMF, and HAC. 


Experiments with Higher K. In partition difference approaches, increasing 
K does not necessarily lead to a 7* that has more clusters. Instead, K serves as 
an upper bound and new clusters will be used in case they reduce the objective. 

To demonstrate how different algorithms handle different K values, Table 5 
shows the consensus criteria and the actual number of clusters in 7* for different 
values of K (note that K = 3 in Iris). The results show that the performance of 
the pairwise similarity methods (CSPA, HAC, DA-Sm) degrades as we increase 
K. This is associated with the fact the actual number of clusters in 7* is equal to 
K which is significantly higher compared to the clusterings in M. Methods based 
on partition difference (NMF and DA-Cr) do not exhibit significant degradation 
and the actual number of clusters does not grow beyond 5 for DA-Cr and 6 for 
NMF. Note that the average number of clusters in IT is 5.26. 


116 E. Cohen et al. 


Table 4. Clustering accuracy measured by ARI compared to true labels 


Dataset | K clusters 2K clusters 

CSPA|NMF |HAC DA-Sm|DA-Cr|CSPA|NMF| HAC |DA-Sm|DA-Cr 
Tris 0.868 /0.746 |0.746|0.716 {0.730 |0.438 |0.463|0.447 |0.433 |0.521 
MNIST |0.684 |0.518 |0.704/0.730 |0.720 |0.412 |0.484/0.545|0.440 /|0.484 
Optdig. |0.712 |0.642 |0.675/0.734 |0.738 |0.380 |0.513|0.630|0.481 | 0.623 
Pendig. |0.674 |0.679|0.499 0.668 |0.668 |0.398 |0.614/0.625 |0.490 |0.639 
Protein |0.365 |0.298 |0.363 0.349 /0.376 |0.237 |0.332/0.301 |0.308 |0.345 
Seeds /|0.705 |0.710 |0.704'0.764 |0.717 |0.424 |0.583|0.573 |0.500 /|0.619 
Wine |0.324 |0.395 |0.371/0.402 [0.398 |0.231 |0.245|0.240 |0.248 |0.238 
# Best |1 1 0 3 2 0 0 2 1 4 

Table 5. Results for Iris dataset with different number of clusters 


K | Consensus Criteria 


# of clusters in consensus clustering 


CSPA 
3 | 0.555 
0.536 
9 | 0.447 
12 | 0.370 


NMF 
0.618 
0.614 
0.614 
0.614 


HAC 
0.618 
0.627 
0.591 
0.507 


DA-Sm 
0.619 
0.608 
0.497 
0.414 


DA-Cr 
0.621 
0.642 
0.642 
0.642 


CSPA 
3 
6 
9 

12 


NMF 


DDD Ww 


HAC 
3 


12 


DA-Sm 
3 
6 
9 

12 


DA-Cr 
3 


5 
5 
5 


5 Conclusion 


Motivated by the recent emergence of specialized hardware platforms, we present 
a new approach to the consensus clustering problem that is based on Ising models 
and solved on the Fujitsu Digital Annealer, a specialized CMOS hardware. We 
perform an extensive empirical evaluation and show that our approach outper- 
forms existing methods on a set of seven datasets. These results shows that using 
specialized hardware in core data mining tasks can be a promising research direc- 
tion. As future work, we plan to investigate additional problems in data mining 
that can benefit from the use of specialized optimization hardware as well as 
experimenting with different types of specialized hardware platforms. 
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Abstract. Using transfer learning to help in solving a new classification 
task where labeled data is scarce is becoming popular. Numerous exper- 
iments with deep neural networks, where the representation learned on 
a source task is transferred to learn a target neural network, have shown 
the benefits of the approach. This paper, similarly, deals with hypothesis 
transfer learning. However, it presents a new approach where, instead of 
transferring a representation, the source hypothesis is kept and this is a 
translation from the target domain to the source domain that is learned. 
In a way, a change of representation is learned. We show how this method 
performs very well on a classification of time series task where the space 
of time series is changed between source and target. 


Keywords: Transfer learning - Boosting 


1 Introduction 


While transfer learning has a long history, dating back at least to the study of 
analogy reasoning, it has enjoyed a spectacular rise of interest in recent years, 
thanks largely to its use and effectiveness in learning new tasks with deep neural 
networks using an architecture learned on a source task. This approach is called 
Hypothesis Transfer Learning [6]. The justification for this strategy is that, in the 
absence of enough data in the target domain to learn anew a good hypothesis, 
it might be effective to transfer the intermediate representations learned on the 
source task. This is indeed the case, for instance, in face analysis when the 
source task is to guess the age of the person, and the target task is to recognize 
the gender. Technically, with neural networks, this amounts to keeping the first 
layers of the source neural network in the target network and learning only the 
last layers, the ones that combine intermediate representations of the examples 
in order to make a prediction. 

Let X, VY and Z be the input, output and feature spaces respectively. Let F 
be a class of representation functions, where f € F: ¥ — Z. Let G be a class 
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of decision functions that use descriptions of the examples in the feature space: 
g € G: Z — YV. Then, in the context of deep neural networks, the hypothesis 
class is H := {h : 3f € Fig € G st. h = go f} and while f is kept (at least 
approximately) from the source problem to the target one, only g remains to be 
learned to solve the target problem. 

In this paper, we adopt a dual perspective: we propose to keep the decision 
function g fixed, and learn translation functions from the target input space to 
the source input space, t : Xr — Xs, such that the target hypothesis space 
becomes Hz := {h7 : In € I, f € Fig € G st. hr = go fom}, which, given that 
hs = go f might be considered as the source hypothesis, may be re-expressed 
as: Hy := {h7 : In € H, f € F,g E€ G st. hr =hs ort}. 

Indeed, for some problems, it might be much more easy to learn a translation 
(also called projections in this paper) from the target input space Ær to the 
source input space Xs than to learn a new target decision function. Furthermore, 
this allows one to tackle problems with different input spaces Vs and Xr. 

In the following, Sect.2 presents TransBoost a new algorithm for trans- 
fer learning. The theoretical analysis of Sect.3 provides a PAC-learning bound 
on the generalization error on the target domain. Controlled experiments are 
described in Sect. 4 together with an analysis of the results. The new approach 
is put in perspective in Sect.5 before we conclude in Sect. 6. 


2 A New Algorithm for Transfer Learning 


Suppose that we have a system that is able to recognize poppy fields in satellite 
images. We might imagine that knowing how to translate a biopsy image into 
a satellite image, we could, using the recognition function defined on satellite 
image, decide if there is cancerous cells in the biopsy. 

Ideally then, one could translate a target query: “what is the label of x7 € 
Xr” into a source query “what is the label of n(x?) € 4s” where hs is the 
source hypothesis which, applied to 7(x7) € Æs, provides the answer we are 
looking for. Notice here that we suppose that Ys = Yr, but not Vs = Xr. 

The goal is then to learn a good translation 7: Xr — s. However, defining 
a proper space of candidate projections JI might be problematic, not to mention 
the risk of overfitting if the space of functions hs o IT has too high a capacity. 
It might be more easy and manageable to discover “weak projections” from V7 
to Xs using a boosting learning scheme. 


Definition 1. A weak projection w.r.t. source decision function hg is a func- 
tion t : Xr — Xs such that the decision function hs (7(x7 )) has better than 
random classification performance on the target training set Sr. 


In this setting, the training set Sz = {(x7 , y7 )}1<i<m is used to learn weak 
projections (Fig. 1). 

Once the concept of weak projection is assumed, it is natural to use a boost- 
ing algorithm in order to learn a set of such weak projections and to combine 
them to get a final good classification on elements of 7. This is what does the 
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Target Domain Source Domain 


Fig. 1. The principle of prediction using TransBoost. A given target example x? is 
projected in the source domain using a set of identified weak projections 7; and the 


prediction for x? is computed as: Hr(x7 ) = sief D, ajhs (75 (x7 )) 


TransBoost algorithm (see Algorithm 1). It does rely on the property of the 
boosting algorithm to find and combine weak rules to get a strong(er) rule. 


3 Theoretical Analysis 


Here, we study the question: can we get guarantees about the performance of 
the learned decision function Hy in the target space using TransBoost? 

We tackle this question in two steps. First, we suppose that we learn a single 
projection function m € IT: Xr — Xs so that hr = hs o7, and we find bounds 
on the generalization error on the target domain given the generalization error 
on the source domain. Second, we turn to the TransBoost algorithm in order to 
justify the use of a boosting approach. 


3.1 Generalization Error Bounds When Using a Single Projection 


For this analysis, we suppose the existence of a source input distribution Py, in 
addition to the target input distribution P æ,. We consider the binary classifica- 
tion setting Y = {—1, +1}, and we note hs and hz respectively the source and 
the target labelling functions. We note Rs(h) (resp. Rz(h)) the risk of a hypoth- 
esis h on the source (resp. target) domain: Rs(h) = Exs~Pxs [hs(x°) 4 hs(x$)] 
(resp. Rr(h) = Ex;.py,[hr(x7) # hr(x7)]). Let Rs(h) and Rr(h) be the 
corresponding empirical risks, with mg training points for S and my training 
points for T. Let dy be the VC dimension of the hypothesis space H. 

In the following, what is learned is a projection m € IT : Xr — Xs in order to 
get a target hypothesis of the form hy = hs on, where hs = ArgMinpens Rs(h) 


is the source hypothesis. Our aim is to upper-bound Rr(hr), the risk of the 
learned hypothesis on the target domain in terms of: 
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Algorithm 1. Transfer learning by boosting 


Input: hs : Xs — Ys the source hypothesis 
Sr = {(x7 ,y7 }icicm: the target training set 


Initialization of the distribution on the training set: Di (i) = 1/m for 
T= er f(a 


forn=1,...,N do 

Find a projection 7; : Wr — Xs st. hs(m(-)) performs better than random 
on D,(Sr) ; 

Let €n be the error rate of hs(mi(-)) on Dn(Sr) : 

En = Pinp, [hs (mn(xi)) Æ yi] (with en < 0.5) ; 

Computes a; = Z log, ( =) : 


Update, for i =1...,m: 


Dali) Jee if hs (Ta (x7 )) = y7 
em if hs(ma(x])) Aye 


_ Dy (i) exp(—an yo? hs(tn(x‘7’))) 


Zn 
where Zn is a normalization factor chosen so that Dn+1 be a distribution on 
ST ; 
end 


Output: the final target hypothesis Hy : Xr — Yr: 


N 


Hr(x’) = sign{ > On hs(na(x7)) | (1) 


n=1 


— the empirical risk Rs (hs) of the source hypothesis, 

— the generalization error of a hypothesis hs in Hs learned from ms examples, 
which depends on dHs, 7 

— the generalization error of a hypothesis hy = hs o m in Hry learned from mz 
examples, which depends on dy. = dhsor, 

—a term that expresses the “proximity” between the source and the target 
problems. 


For the latter term, we adapt the theoretical study of McNamara and Balcan 
[9] on the transfer of representation in deep neural networks. We suppose that 
Ps, Pr, hs, hr = hs or (x € IL), hs and II have the property: 


Vhs € Hs: Min Rr(hs om) < w(Rs(hs)) (2) 


where w : R — R is a non-decreasing function. 

Equation (2) means that the best target hypothesis expressed using the 
learned source hypothesis has a true risk bounded by a non-decreasing func- 
tion of the true risk on the source domain of the learned source hypothesis. 

We are now in position to get the desired theorem. 
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Theorem 1. Let w : R — R be a non-decreasing function. Suppose that Ps, 
Pr, hs, hr =hson(n € IL), hs and II have the property given by Eq. (2). Let 
T := ArgMin,ey Rr(hs om), be the best apparent projection. 

Then, with probability at least 1—6 (6 € (0,1)) over pairs of training sets 
for tasks S and T: 


Rr(hr) < w(Rs(hs)) T Toolea tale) ERWEE) 
= E 


n y 2dnson log(2emr/dnsan) + 210g(8/8) 


MT 


Proof. Let m* = ArgMinper Rr(hs or). With probability at least 1 — 4: 


2 dhson log(2emr/dhson) + 2108(8/9) 


MT 


Rr(hso®) < Rr(hso®) + aj 


< Rr(hson*) + 2j 2 dhson 108 Sena fhau) + 2log(8/8) 
< Rr(hson*) 1] Pate log Cema Asa! + 2log(8/65) 
< w(Rs(hs)) + ay 2 drson log e + 2log(8/8) 
< w(Rs(hs)) + 2 a log Persia) + 2log(8/ô) 


ed y 2dison logQemr/dnsan) + 2l08(8/6) 
MT 


This follows from the fact that [10] (p. 48) using m training points and 
a hypothesis class of VC dimension d, with probability at least 1 — 6, for all 
hypotheses h simultaneously, the true risk R(h) and empirical risk R(h) satisfy 
|(R(h) — R(h)| < 2 vi 2 dlosCem/d) +? le8(4/9) For hs o IT, this yields the first and 
third inequalities with probabilities at least 1— 6/2. For Hs, this yields the fifth 
inequality with probability at least 1 — 6/2. Applying the union bound archives 
the desired results. The second inequality follows from the definition of 7, and 
the fourth inequality is where we inject our assumption about the transferability 
(or proximity) between the source and the target problem. 

We can thus control the generalization error on the transfer domain by con- 
trolling dp5,,,, Ms and w which measures the link between the domain and the 
target domain. The number of target training data mz is typically supposed to 
be small in transfer learning and thus cannot be employed to control the error. 
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3.2 Boosting Projections from Target to Source 


The above analysis bounds the generalization error of the learned target hypoth- 
esis hs o T in terms, among others, of the VC dimension of the space hg o IT. 
The problem of controlling the capacity of such a space of functions in order to 
prevent under or over-fitting is the same as in the traditional supervised learning 
setting. The difficulty lies in choosing the right space I of projection functions 
from Xr to Xs. 

The space of hypothesis functions considered is: 


N 
L(hgo IIp) = {x œ> sign b Qn (hg om (x’ )) : Yn, Qn € R, and mn € m} 


n=1 


where [Tg is a space of weak projections satisfying definition (1). 
Now, from [11] (p. 109), the VC dimension of the space hg o Hpg satisfies: 


di(nsottz) < N(drsorns +1) (3log(N(drsorns + 1)) + 2) 


If dasottg < dngom, then dL(hsolp) can also be much less than drsorm, and 
theorem (1) provides tighter bounds. 

Using the TransBoost method, we can thus gain both on the theoretical 
bounds on the generalization error and on the ease of finding an appropriate 
space of projections Xr > Xs. 


4 Design of the Experiments 


4.1 The Main Dimensions of Experiments in Transfer Learning 


There are two dimensions that can be expected to govern the efficiency of transfer 
learning: 


1. The level of signal in the target data. 
2. The relatedness between the source and the target domains. 


Regarding the first dimension, one can expect that if there is no signal in the 
target data (i.e. the examples are labelled randomly), then no regularity can be 
extracted, directly or using transfer. In fact, only overfitting of the training data 
can potentially occur. If, on the contrary, the target learning task is easy, then 
there cannot be much advantage in using transfer learning. A question therefore 
arises as to whether there might be an optimal level of signal in the target data 
so as to maximally benefit from transfer learning. 

The second dimension is tricky. Here, we intuitively expect that the closer the 
source and target domains (and problems), the more profitable transfer learning 
should be. However, how should we measure the “relatedness” of the source and 
target problems? In the domain adaptation setting, closeness can be measured 
through a measure of the divergence between the source distribution and the 
target one, since they are defined on the same input space. In transfer learning, 
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the input spaces can be different, so that it is much more difficult to define a 
divergence between distributions. This is why we resorted to the function w in 
our theoretical analysis. In our experiments, we control relatedness through the 
information shared between source and target (see below). 


4.2 Experimental Setup 


In our study, we devised an experimental setup that would allow us to control 
the two dimensions above. 

In the target domain, the learning task is to classify time series of length 
tr into two classes: hz : RT — {—1, +1}. By controlling the level of noise and 
the difference between the distributions governing the two classes, we can control 
the signal level, that is the difficulty of extracting information from the target 
training data. We control the amount of information by varying the size mz of 
the target training set. 

Likewise, the source input space is the space of sequences of real measure- 
ments of length ts. Therefore, we have hs : RS — {—1, +1}. 

Varying |ts — tz| is a way of controlling the information potentially shared 
in the two domains. With ts = tz, the two input domains are the same. 

Note that learning to classify times series is not a trivial task. It has many 
applications, some of them involving to classify time series of length different 
from the length for which exists a classifier. 


4.3 Description of the Experiments 


Time series were generated according to the following equation: 


x, = tx slope x class + Xmaz sin(w; xt + pj) + = n(t) (4) 
—_-e---- z  —— “ 
information gain sub shape within class noise factor 


The fact that the noise factor is generated according to a Gaussian distribution 
induces a distribution over the data (class € {—1,+1}). 
The level of signal in the training data is governed by: 


1. the slope factor: the higher the value of the slope factor, the easier the dis- 
crimination between the two classes at each additional time step 

2. the number of different shapes in each class of sequences, each shape controlled 
by w; and @;, and the importance of this factor in the equation being weighted 
by Xmaz 

3. the noise factor n(t) 

4. the length of the time series, that is the number of measurements 

5. the size of the training set 


In our experiments, the noise factor is generated according to a Gaussian distri- 
bution of mean = 0 and standard deviation in {0.001, 0.002, 0.02, 0.2, 1}. 

Figure 2 illustrates what can be obtained with slope = 0.01 with 3 subclasses 
in the +1 class, and 2 subclasses in the —1 class. 
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Fig. 2. A synthetic data set S with 5 times series where 7 is Gaussian (u = 0,0 = 0.2). 


In the experiments reported here, we kept the size of the training set constant. 
In each experiment, 900 times series of length 200 were generated according to 
the equation described above: 450 times series in each class —1 or +1. We varied 
the difficulty of learning by varying the slope from almost non existent: 0.001 to 
significant: 0.01. Similarly, we varied the length ty of the target training set in 
{20, 50,70, 100} thus providing increasing levels of signal. 

A target training data set of 300 time series was drawn equally balanced 
between the two classes. Note that this relatively small number corresponds 
to transfer learning scenarios where the training data is limited in the target 
domain. The remaining 600 time series were used as a test set. The source 
hypothesis was learned using the complete time series generated as explained 
above. 

In these experiments, the set of projections IT was chosen as a set of “hinge 
functions”, defined by three parameters, the slope of the first linear part, the 
time t where the hinge takes place, and the slope of the second linear part. The 
set is explored randomly by the algorithm and a projection is retained if its 
error rate on the current weighted data is lower than 0.45. We explored other, 
richer, spaces of projections without gaining superior performances. This simple 
set seems to be sufficient for this learning task. 

In order to better assess the value of TransBoost, its performance was com- 
pared (1) to a classifier (Gaussian SVM as implemented in Scikit Learn) acting 
directly on the target training data, (2) to a boosting algorithm operating in 
the target domain with base classifiers being Gaussian SVMs, and (3) to a base- 
line transfer learning method that consists in finding a regression from the target 
input space to the source input space using a SVR regression. In this last method 
the regression acts as a translation from V7 to ¥s and the class of an example 
x7 is given by hg (regression(x7 )). 

Table 1 provides representative examples of the results obtained. Each cell of 
the table shows the average performance (and the standard deviations) computed 
from 100 experiments repeated under the same conditions. The experimental 
conditions are organized according to the level of signal in the training data. In 
the experiments corresponding to this table, the source hypotheses were learned 
according to the first protocol defined above. 

Several lessons can be drawn. First of all, in most situations, TransBoost 
brings very significant gains over learning without transfer or using transfer 
learning with regression. Figures 3 and 4 that sum up a larger set of experimental 
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Table 1. Comparison of the error rate (lower is better) between: learning directly in 
the target domain (columns hy (train) and hy (test)), using TransBoost (columns Hy 
(train) and Hy (test)), learning in the source domain (column hs (test)) and, finally, 
mapping the time series with a SVR regression and using hs (naive transfer, column 
H*(test)). Test errors are highlighted in the orange columns. Bold numbers indicate 
where TransBoost significantly dominates both learning without transfer and learning 
with naive transfer. 


slope, noise, ty |hy (train) |hy (test) |H7 (train) |H7 (test) (hs (test) H4 (test) 

0.001, 0.001, 20 |0.46 + 0.02|0.50 + 0.08/0.08 + 0.03|0.08 + 0.02 0.05 0.49 + 0.01 
0.005, 0.001, 20 |0.46 + 0.02|0.49 + 0.01/0.01 + 0.01|0.01 + 0.01 0.01 0.45 + 0.01 
0.005, 0.002, 20 |0.46 + 0.02|0.49 + 0.03/0.03 + 0.02|0.04 + 0.02 0.02 0.43 + 0.01 
0.005, 0.02, 20 |0.44 + 0.02|0.48 + 0.03/0.09 + 0.01|0.10 + 0.01 0.01 0.47 + 0.01 
0.001, 0.2, 20 0.46 + 0.02/0.50 + 0.01|0.46 + 0.02|0.51 + 0.02 0.11 0.49 + 0.01 
0.01, 0.2, 20 0.42 + 0.03|0.47 + 0.03)0.34 + 0.02/0.35 + 0.02 |0.02 0.35 + 0.01 
0.001, 0.001, 50 |0.46 + 0.02|0.50 + 0.01/0.08 + 0.03|0.08 + 0.02 0.06 0.41 + 0.01 
0.005, 0.001, 50 |0.25 + 0.07|0.28 + 0.09|0.01 + 0.01|0.01 + 0.01 0.01 0.28 + 0.01 
0.005, 0.002, 50 |0.27 + 0.07|0.30 + 0.08/0.02 + 0.01|0.02 + 0.01 0.02 0.28 + 0.01 
0.005, 0.02, 50 |0.26 + 0.07|0.30 + 0.08/0.04 + 0.01|0.04 + 0.01 0.01 0.31 + 0.01 
0.001, 0.2, 50 0.44 + 0.02|0.50 + 0.01/0.38 + 0.03/0.44 + 0.02 |0.15 0.43 + 0.01 
0.01, 0.2, 50 0.10 + 0.03|0.12 + 0.04/0.10 + 0.02/0.11 + 0.02 |0.03 0.15 + 0.02 
0.001, 0.001, 100|0.43 + 0.03|0.47 + 0.03|0.07 + 0.02/0.07 + 0.02/0.02 0.23 + 0.01 
0.005, 0.001, 100|0.06 + 0.03|0.07 + 0.03)0.01 + 0.01|0.01 + 0.01 0.01 0.07 + 0.02 
0.005, 0.002, 100|0.08 + 0.03|0.10 + 0.04/0.02 + 0.01/0.02 + 0.01/0.02 0.07 + 0.01 
0.005, 0.02, 100 |0.08 + 0.03|0.09 + 0.03/0.02 + 0.01|0.03 + 0.01 0.01 0.07 + 0.01 
0.001, 0.2, 100 |0.04 + 0.03|0.46 + 0.02/0.28 + 0.02|0.31 + 0.01 0.16 0.31 + 0.01 
0.01, 0.2, 100 0.03 + 0.01/0.05 + 0.02|0.04 + 0.01|0.05 + 0.01 (0.02 0.05 + 0.01 


conditions make this even more striking. In both tables, the x-axis reports the 
error rate obtained using TransBoost, while the y-axis reports the error rate of 
the competing algorithm: either the hypothesis hz learnt on the target training 
data alone (Fig. 3), or the hypothesis H+ learned on the target data projected on 
the source input space using a SVR regression (Fig. 4). The remarkable efficiency 
of TransBoost in a large spectrum of situations is readily apparent. 

Secondly, as expected, Transboost is less dominant when either the data is so 
noisy that no method can learn from the data (high level of noise or low slope): 
this is apparent on the right part of the graphs 3 and 4 (near the diagonal), 
or when the task is so easy (large slope and/or low noise) that nothing can be 
gained from transfer learning (left part of the two graphs). 

We did not report here the results obtained with boosting directly in the 
target input space ¥yr since the learning performance was almost the same as 
the performance as the one of the SVM classifier. This shows that this is not 
boosting in itself that brings a gain. 
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Fig. 3. Comparison of error rates. y- 
axis: test error of the SVM classifier 
(without transfer). x-axis: test error of 
the TransBoost classifier with 10 boost- 
ing steps. The results of 75 experi- 


Fig. 4. Comparison of error rates. y- 
axis: test error of the “naïve” transfer 
method. x-axis: test error of the Trans- 
Boost classifier with 10 boosting steps. 
The results of 75 experiments (each one 


repeated 100 times) are summed up in 
this graph. 


ments (each one repeated 100 times) 
are summed up in this graph. 


4.4 Additional Experiments 


We show here, in Figs.5, 6 and 7 qualitative results obtained on the classical 
half-moon problem. It is apparent that Transboost brings satisfying results. 
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Fig. 5. Experiments on the half-moon problem. 


5 Comparison to Previous Works 


In the theoretical analysis of Ben-David et al. [1,2], one central idea is that 
a common representation space should be found in which the projections of 
the source data {(x$)}1<i<m and of the target data {(x7)}i<i<m should be as 
undistinguishable as possible using discriminative functions from the hypothesis 
space H. The intuition is that if the domains become indistinguishable, a classi- 
fier constructed for the source domain should work also for the target domain. 
It has been at the core of many proposed methods so far [3,5,7, 12]. 

In [8] a scenario in which multiple sources are available for a single target 
domain is studied. For each source i € {1,...,k}, the input distribution D; is 
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Fig.6. A KNN model trained on the Fig. 7. A KNN model transboosted on 
few target data points (in yellow). the few target data points. 
(Color figure online) 
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known as well as a hypothesis h; with loss bounded by £ on D;. It is further 
assumed that the target input distribution is a mixture of the k source distribu- 
tions D;. The adaptation problem is thus seen as finding a combination of the 
hypotheses h;. It is shown that guarantees on the loss of the combined target 
hypothesis can be given for some forms of combinations. However, the authors do 
not show how to learn the parameters of these combinations. In [4], the authors 
present a system called TrAdaboost, which uses a boosting scheme to eliminate 
data points that seem irrelevant for the new task defined over the same space 
X. Despite the use of boosting, the scope is quite different from ours. 

Finally, the authors in [6] study a scheme seemingly very close to ours. They 
define Hypothesis Transfer Learning algorithms as algorithms taking as input a 
training set in the target domain and a source hypothesis in the source domain, 
and producing a target hypothesis: 


Abt ; (Xr x Yr)” x Hs > Hr C y* 


One goal of the paper is to identify the effect of the source hypothesis on the 
generalization properties of A!. However, the scope of the analysis is limited in 
several ways. First, it focusses on linear regression with the Regularized Least 
Square algorithm. Second, the formal framework necessitates that in fact Xr = 
Xs and Yr = Ys. It is thus more an analysis of domain adaptation than of 
transfer learning. Third, the transfer learning algorithm in effect tries to find a 
weight vector w7 as close as possible to the source weight vector w° while fitting 
the target data set. There is therefore a parameter to set. More importantly, 
the consequence is that the analysis singles out the performance of the source 
hypothesis on the target domain as the most significant factor controlling the 
expected error on the target problem. Again, therefore, the target hypothesis 
cannot be much different from the source one, which seems to defeat the whole 
purpose of transfer learning. 


6 Conclusion 


This paper has presented a new transfer learning algorithm, TransBoost, that 
uses the boosting mechanism in an original way by selecting and combining weak 
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projections from the target domain to the source domain. The algorithm inherits 
some nice features from boosting. There is only one parameter to set: the number 
of boosting steps, and guarantees on the training error an on the test error are 
easily derived from the ones obtained in the theory of boosting. 
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Abstract. A current challenge in graph clustering is to tackle the issue 
of complex networks, i.e, graphs with attributed vertices and/or edges. In 
this paper, we present GraphTrees, a novel method that relies on random 
decision trees to compute pairwise dissimilarities between vertices in a 
graph. We show that using different types of trees, it is possible to extend 
this framework to graphs where the vertices have attributes. While many 
existing methods that tackle the problem of clustering vertices in an 
attributed graph are limited to categorical attributes, GraphTrees can 
handle heterogeneous types of vertex attributes. Moreover, unlike other 
approaches, the attributes do not need to be preprocessed. We also show 
that our approach is competitive with well-known methods in the case 
of non-attributed graphs in terms of quality of clustering, and provides 
promising results in the case of vertex-attributed graphs. By extending 
the use of an already well established approach — the random trees — to 
graphs, our proposed approach opens new research directions, by lever- 
aging decades of research on this topic. 


Keywords: Graph clustering - Attributed graph - Random tree - 
Dissimilarity - Heterogeneous data 


1 Introduction 


Identifying community structure in graphs is a challenging task in many appli- 
cations: computer networks, social networks, etc. Graphs have an expressive 
power that enables an efficient representation of relations between objects as 
well as their properties. Attributed graphs where vertices or edges are endowed 
with a set of attributes are now widely available, many of them being created 
and curated by the semantic web community. While these so-called knowledge 
graphs! contain a lot of information, their exploration can be challenging in 
practice. In particular, common approaches to find communities in such graphs 
rely on rather complex transformations of the input graph. 


1 Although many definitions can be found in the literature [9]. 
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In this paper, we propose a decision tree based method that we call Graph- 
Trees (GT) to compute dissimilarities between vertices in a straightforward man- 
ner. The paper is organized as follows. In Sect. 2, we briefly survey related work. 
We present our method in Sect.3, and we discuss its performance in Sect. 4 
through an empirical study on real and synthetic datasets. In the last section of 
the paper, we present a brief discussion of our results and state some perspectives 
for future research. 


Main Contributions of the Paper: 


1. We propose a first step to bridge the gap between random decision trees and 
graph clustering and extend it to vertex attributed graphs (Subsect. 4.1). 

2. We show that the vertex-vertex dissimilarity is meaningful and can be used 
for clustering in graphs (Subsect. 4.2). 

3. Our method GT applies directly on the input graph without any preprocess- 
ing, unlike the many community detection in vertex-attributed graphs that 
rely on the transformation of the input graph. 


2 Related Work 


Community detection aims to find highly connected groups of vertices in a graph. 
Numerous methods have been proposed to tackle this problem [1,8,24]. In the 
case of vertex-attributed? graph, clustering aims at finding homogeneous groups 
of vertices sharing (i) common neighbourhoods and structural properties, and (ii) 
common attributes. A vertex-attributed graph is thought of as a finite structure 
G = (V, E, A), where 


— V = {v1, vo,..., Un} is the set of vertices of G, 
— ECV x V is the set of edges between the vertices of V, and 
— A= {x1,2%2,...,2n} is the set of feature tuples, where each x; represents the 


attribute value of the vertex vi. 


In the case of vertex-attributed graphs, the problem of clustering refers to 
finding communities (7.e., clusters), where vertices in the same cluster are densely 
connected, whereas vertices that do not belong to the same cluster are sparsely 
connected. Moreover, as attributes are also taken into account, the vertices in 
the same cluster should be similar w.r.t. attributes. 

In this section, we briefly recall existing approaches to tackle this problem. 


Weight-Based Approaches. The weight-based approach consists in trans- 
forming the attributed graphs in weighted graphs. Standard clustering algo- 
rithms that focus on structural properties can then be applied. 

The problem of mapping attribute information into edge weight have been 
considered by several authors. Neville et al. define a matching coefficient [20] as 


? To avoid terminology-related issues, we will exclusively use the terms vertex for 
graphs and node for random trees throughout the paper. 
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a similarity measure S between two vertices v; and vj based on the number of 
attribute values the two vertices have in common. The value S», v; is used as the 
edges weight between v; and v;. Although this approach leads to good results 
using Min-Cut [15], MajorClust [26] and spectral clustering [25], only nominal 
attributes can be handled. An extended matching coefficient was proposed in [27] 
to overcome this limitation, based on a combination of normalized dissimilarities 
between continuous attributes and increments of the resulting weight per pair 
of common categorical attributes. 


Optimization of Quality Functions. A second type of methods aim at finding 
an optimal clustering of the vertices by optimizing a quality function over the 
partitions (clusters). 

A commonly used quality function is modularity [21], that measures the den- 
sity differences between vertices within the same cluster and vertices in different 
clusters. However, modularity is only based on the structural properties of the 
graph. In [6], the authors use entropy as the quality metric to optimize between 
attributes, combined with a modularity-based optimization. Another method, 
recently proposed by Combe et al. [5], groups similar vertices by maximizing 
both modularity and inertia. 

However, these methods suffer from the same drawbacks as any other mod- 
ularity optimization based methods in simple graphs. Indeed, it was shown by 
[17] that these methods are biased, and do not always lead to the best clustering. 
For instance, such methods fail to detect small clusters in graphs with clusters 
of different sizes. 


Aggregated Distance Measures. Another type of methods used to find 
communities in vertex-attributed graphs is to define an aggregated vertex- 
vertex distance between the topological distance and the symbolic distance. 
All these methods express a distance d,,, between two vertices v; and vj as 
dy,v, = adr(vi, vj) + (1 — a)ds(v;,v;) where dr is a structural distance and 
dg is a distance in the attribute space. These structural and attribute distances 
represent the two different aspects of the data. These distances can be chosen 
from the vast number of available ones in the literature. For instance, in [4] a 
combination of geodesic distance and cosine similarities are used by the authors. 
The parameter a is useful to control the importance of each aspect of the over- 
all similarity in each use case. These methods are appealing because once the 
distances between vertices are obtained, many clustering algorithms that cannot 
be applied to structures such as graphs can be used to find communities. 


Miscellaneous. There is yet another family of methods that enable the use of 
common clustering methods on attributed graphs. SA-cluster [3,32] is a method 
performing the clustering task by adding new vertices. The virtual vertices rep- 
resent possible values of the attributes. This approach, although appealing by its 
simplicity, has some drawbacks. First, continuous attributes cannot be taken into 
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account. Second, the complexity can increase rapidly as the number of added 
vertices depends on the number of attributes and values for each attribute. How- 
ever, the authors proposed an improvement of their method named Inc- Cluster 
in [33], where they reduce its complexity. 

Some authors have worked on model-based approaches for clustering in 
vertex-attributed settings. In [29], the authors proposed a method based on 
a bayesian probabilistic model that is used to perform the clustering of vertex- 
attributed graphs, by transforming the clustering problem into a probabilistic 
inference problem. Also, graph embeddings can be used for this task of vertex- 
attributed graph clustering. Examples of these techniques include node2vec [13] 
or deepwalk [23], and aim to efficiently learn a low dimensional vector represen- 
tation of each vertex. Some authors focused on extending vertex embeddings to 
vertex-attributed networks [11, 14,30]. 

In this paper, we take a different approach and present a tree-based method 
enabling the computation of vertex-vertex dissimilarities. This method is pre- 
sented in the next section. 


3 Method 


Previous works [7,28] have shown that random partitions of data can be used 
to compute a similarity between the instances. In particular, in Unsupervised 
Extremely Randomized Trees (UET), the idea is that all instances ending up 
in the same leaves are more similar to each other than to other instances. The 
pairwise similarities s(i, j) are obtained by increasing s(i, j) for each leaf where 
both i and j appear. A normalisation is finally performed when all trees have 
been constructed, so that values lie in the interval [0, 1]. Leaves, and, more 
generally, nodes of the trees can be viewed as partitions of the original space. 
Enumerating the number of co-occurrences in the leaves is then the same as 
enumerating the number of co-occurrence of instances in the smallest regions of 
a specific partition. 

So far, this type of approach has not been applied to graphs. The intuition 
behind our proposed method, GT, is to leverage a similar partition in the ver- 
tices of a graph. Instead of using the similarity computation that we described 
previously, we chose to use the mass-based approach introduced by Ting et al. 
[28] instead. The key property of their measure is that the dissimilarity between 
two instances in a dense region is higher than the same interpoint dissimilarity 
between two instances in a sparse region of the same space. One of the inter- 
esting aspects of this approach is that a dissimilarity is obtained without any 
post-processing. 

Let H € H(D) be a hierarchical partitioning of the original space of a dataset 
D into non-overlapping and non-empty regions, and let R(x, y| H) be the smallest 
local region covering x and y with respect to H. The mass-based dissimilarity 
Me estimated by a finite number t of models — here, random trees — is given by 
the following equation: 
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(z,y|D) = R(x, y|Hi)) (1) 


ğ 


where P(R) = DT Žep 1(2 € R). Figure 1 presents an example of a hierarchical 
partition H of a dataset D containing 8 instances. These instances are vertices 
in our case. For the = of the example, let us compute me(1,4) and m,(1,8). 
We have m-(1,4) = ¢(2) = 0.25, as the smallest region where instances 1 and 4 
co-appear contains 2° instances. However, me(1,8) = (8) = 1, since instances 1 
and 8 only appear in one region of size 8, the original space. The same approach 


can be applied to graphs. 


1, 2,3, 4, 5, 6,7, 8 


1,3,4,5 2,6,7,8 
A 3,5 2 6,7,8 


Fig. 1. Example of partitioning of 8 instances in non-overlapping non-empty regions 
using a random tree structure. The blue and red circles denote the smallest nodes (i.e., 
regions) containing vertices 1 and 4 and vertices 1 and 8, respectively. (Color figure 
online) 


Our proposed method is based on two steps: (i) obtain several partitions of 
the vertices using random trees, (ii) use the trees to obtain a relevant dissimilarity 
measure between the vertices. The Algorithm 1 describes how to build one tree, 
describing one possible partition of the vertices. Each tree corresponds to a model 
of (1). Finally, the dissimilarity can be obtained using Eq. 1. 

The computation of pairwise vertex-vertex dissimilarities using Graph Trees 
and the mass-based dissimilarity we just described has a time complexity of 
O(t-Wlog(W) + n7tlog(W)) [28], where t is the number of trees, Y the maximum 
height of the trees, and n is the number of vertices. When YW << n, this time 
complexity becomes O(n?). 

To extend this approach to vertex-attributed graphs, we propose to build a 
forest containing trees obtained by GT over the vertices and trees obtained by 
UET on the vertex attributes. We can then compute the dissimilarity between 
vertices by averaging the dissimilarities obtained by both types of trees. 

In the next section, we evaluate GT on both real-world and synthetic 
datasets. 


4 Evaluation 


This section is divided into 2 subsections. First, we assess GT’s perfor- 
mance on graphs without vertex attributes (Subsect.4.1). Then we present 
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Algorithm 1. Algorithm describing how to build a random tree partition- 
ing the vertices of a graph. 


Data: A graph G(V, E), an uninitialized stack S$ 

root-node = V; // The root node contains all the vertices of G 
Us = a vertex sampled without replacement from V; 

Viept = N (vs) U {us}; //N(v) returns the set of neighbours of v 
Vright =V \ Vie ft ; 

Push Vieft and Vright tO S ; 

leaves = []; //leaves is an empty list 

while S is not empty do 

Vnode = pop the last element of S; 

if |Vnode| < min then 

Append Vnode to leaves; //node size in lower than nmin, it is a leaf 


node 
nd 


else 

Us = a vertex sampled without replacement from Vnode; 
Vieft = (Vnode AN (vs)) U {us}; 

Vright = Vnode \ Vieft 5 

Push Vier: to S; 

Push Vyigne to S; 

end 


oO 


end 
return leaves; 


the performance of our proposed method in the case of vertex-attributed graphs 
(Subsect. 4.2). An implementation of GT, as well as these benchmarks are avail- 
able on https://github.com/jdalleau/gt. 


4.1 Graph Trees on Simple Graphs 


We first evaluate our approach on simple graphs with no attributes, in order to 
assess if our proposed method is able to discriminate clusters in such graphs. 
This evaluation is performed on both synthetic and real-world graphs, presented 
Table 1. 


Table 1. Datasets used for the evaluation of clustering on simple graphs using 
graph-trees 


Dataset # vertices | # edges | Average degree | # clusters 
Football 115 1226 10.66 10 
Email-Eu-Core | 1005 25571 33.24 42 
Polbooks 105 441 8.40 

SBM 450 65994 | 293.307 
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The graphs we call SBM are synthetic graphs generated using stochastic 
block models composed of k blocks of a user-defined size, that are connected by 
edges depending on a specific probability which is a parameter. The Football 
graph represents a network of American football games during a given season 
[12]. The Email-Eu-Core graph [18,31] represents relations between members 
of a research institution, where edges represents communication between those 
members. We also use a random graph in our first experiment. This graph is 
an Erdos-Renyi graph [10] generated with the parameters n = 300 and p = 0.2. 
Finally, the PolBooks data [16] is a graph where nodes represent books about 
US politics sold by an online merchant and edges books that were frequently 
purchased by the same buyers. 

Our first empirical setting aims to compare the differences between the mean 
intracluster and the mean intercluster dissimilarities. These metrics enable a 
comparison that is agnostic to a subsequent clustering method. 

The mean difference is computed as follows. First, the arithmetic mean of 
the pairwise similarities between all vertices with the same label is computed, 
corresponding to the mean intracluster dissimilarity Mintra. The same process 
is performed for vertices with a different label, giving the mean intercluster 
similarity Hinter: We finally compute the difference A = |Hintra — Linter|. In 
our experiments, this difference A is computed 20 times. A denotes the mean of 
differences between runs, and o its standard deviation. The results are presented 
Table 2. We observe that in the case of the random graph, A is close to 0, unlike 
the graphs where a cluster structure exists. A projection of the vertices based 
on their pairwise dissimilarity obtained using GT is presented Fig. 2. 


Table 2. Mean difference between intercluster and intracluster similarities in different 
settings. 


Dataset A o 
Random graph | 0.0003 | 0.0002 
SBM 0.29 | 0.005 
Football 0.25 | 0.002 


We then compare the Normalized Mutual Information (NMI) obtained using 
GT with the NMI obtained using two well-known clustering methods on simple 
graphs, namely MCL [8] and Louvain [1]. NMI is a clustering quality metric 
when a ground truth is available. Its values lie in the range [0,1], with a value 
of 1 being a perfect matching between the computed clusters and the reference 
one. The empirical protocol is the following: 


1. Compute the dissimilarity matrices using GT, with a total number of trees 
Ntrees = 200. 

2. Obtain a 2D projection of the points using t-SNE [19] (k = 2). 

3. Apply k-means on the points of the projection and compute the NMI. 
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Random graph SBM Football 


Fig. 2. Projection of the vertices obtained using GT on (left) a random graph, (mid- 
dle) an SBM generated graph (middle) and (right) the football graph. Each cluster 
membership is denoted by a different color. Note how in the case of the random graph, 
no clear cluster can be observed. (Color figure online) 


We repeated this procedure 20 times and computed means and standard devia- 
tions of the NMI. 

The results are presented Table 3. We compared the mean NMI using the 
t-test, and checked that the differences between the obtained values are statisti- 
cally significant. 

We observe that our approach is competitive with the two well-known meth- 
ods we chose in the case of non-attributed graphs on the benchmark datasets. 
In one specific case, we even observe that Graph trees significantly outperforms 
state of the art results, on the graphs generated by the SBM model. Since the 
dissimilarity computation is based on the method proposed by [28] to find clus- 
ters in regions of varying densities, this may indicate that our approach performs 
particularly well in the case of clusters of different size. 


Table 3. Comparison of NMI on benchmark graph datasets. Best results are in bold- 
face. 


Dataset Graph-trees Louvain MCL 

Football 0.923 (0.007) | 0.924 (0.000) | 0.879 (0.015) 
Email-Eu-Core | 0.649 (0.008) | 0.428 (0.000) |0.589 (0.012) 
Polbooks 0.524 (0.012) |0.521 (0.000) | 0.544 (0.02) 
SBM 0.998 (0.005) | 0.684 (0.000) | 0.846 (0.000) 


4.2 Graph Trees on Attributed Graphs 


Now that we have tested GT on simple graphs, we can assess its performance 
on vertex-attributed graphs. The datasets that we used in this subsection are 
presented Table 4. 

WebKB represents relations between web pages of four universities, where 
each vertex label corresponds to the university and the attributes represent the 
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words that appear in the page. The Parliament dataset is a graph where the 
vertices represent french parliament members, linked by an edge if they cosigned 
a bill. The vertex attributes indicate their constituency, and each vertex has a 
label that corresponds to their political party. 


Table 4. Datasets used for the evaluation of clustering on attributed graphs using GT 


Dataset # vertices | # edges | # attributes # clusters 
WebKB 877 1480 1703 4 
Parliament | 451 11646 108 7 
AVR 307 6526 6 2 


The empirical setup is the following. We first compute the vertex-vertex dis- 
similarities using GT, and the vertex-vertex dissimilarities using UET. In this 
first step, a forest of trees on the structures and a forest of trees on the attributes 
of each vertex are constructed. We then compute the average of the pairwise dis- 
similarities. Finally, we then apply t-SNE and use the k-means algorithm on the 
points in the embedded space. We set k to the number of clusters, since we have 
the ground truths. We repeat these steps 20 times and report the means and 
standard deviations. During our experiments, we found out that preprocessing 
the dissimilarities prior to the clustering phase may lead to better results, in par- 
ticular with Scikit learn’s [22] Quantile Transformer. This transformation tends 
to spread out the most frequent values and to reduce the impact of outliers. 
In our evaluations, we performed this quantile transformation prior to every 
clustering, with Nguantile = 10. 

The NMI obtained after the clustering step are presented in Table 5. 


Table 5. NMI using GT on the structure only, UET on the attributes only and 
GT-+UET. Best results are indicated in boldface. 


Dataset GT UET GT+UET 

WebKB 0.64 (0.07) | 0.73 (0.08) | 0.98 (0.01) 
HVR 0.58 (0.06) | 0.58 (0.00) | 0.89 (0.06) 
Parliament | 0.65 (0.02) | 0.03 (0.00) | 0.66 (0.02) 


We observe that for two datasets, namely WebKB and HVR, considering 
both structural and attribute information leads to a significant improvement in 
NMI. For the other dataset considered in this evaluation, while the attribute 
information does not improve the NMI, we observe that is does not decrease it 
either. Here, we give the same weight to structural and attribute information. 
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Structure Attributes Both 


Fig. 3. Projection of the WebKB data based on the dissimilarities computed (left) 
using GT on structural data, (middle) using UET on the attributes data and (right) 
using the aggregated dissimilarity. Each cluster membership is denoted by a different 
color. (Color figure online) 


In Fig. 3 we present the projection of the WebKB dataset, where we observe 
that the structure and attribute information both bring a different view of the 
data, each with a strong cluster structure. 

HVR and Parliament datasets are extracted from [2]. Using their proposed 
approach, they obtain an NMI of 0.89 and 0.78, respectively. Although the NMI 
we obtained using our approach are not consistently better in this first assess- 
ment, the methods still seems to give similar results without any fine tuning. 


5 Discussion and Future Work 


In this paper, we presented a method based on the construction of random 
trees to compute dissimilarities between graph vertices, called GT. For vertex 
clustering purposes, our proposed approach is plug-and-play, since any clustering 
algorithm that can work on a dissimilarity matrix can then be used. Moreover, 
it could find application beyond graphs, for instance in relational structures in 
general. 

Although the goal of our empirical study was not to show a clear superior- 
ity in terms of clustering but rather to assess the vertex-vertex dissimilarities 
obtained by GT, we showed that our proposed approach is competitive with well- 
known clustering methods, Louvain and MCL. We also showed that by comput- 
ing forests of graph trees and other trees that specialize in other types of input 
data, e.g, feature vectors, it is then possible to compute pairwise dissimilarities 
between vertices in attributed graphs. 

Some aspects are still to be considered. First, the importance of the vertex 
attributes is dataset dependent and, in some cases, considering the attributes can 
add noise. Moreover, the aggregation method between the graph trees and the 
attribute trees can play an essential role. Indeed, in all our experiments, we gave 
the same importance to the attribute and structural dissimilarities. This choice 
implies that both the graph trees and the attribute trees have the same weight, 
which may not always be the case. Finally, we chose here a specific algorithm to 
compute the dissimilarity in the attribute space, namely, UET. The poor results 
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we obtained for some datasets may be caused by some limitations of UET in 
these cases. 

It should be noted that our empirical results depend on the choice of a 
specific clustering algorithm. Indeed, GT is not a clustering method per se, 
but a method to compute pairwise dissimilarities between vertices. Like other 
dissimilarity-based methods, this is a strength of the method we propose in this 
paper. Indeed, the clustering task can be performed using many algorithms, 
leveraging their respective strengths and weaknesses. 

As a future work, we will explore an approach where the choice of whether 
to consider the attribute space in the case of vertex-attributed graphs is guided 
by the distribution of the variables or the visualization of the embedding. We 
also plan to apply our methods on bigger graphs than the ones we used in this 
paper. 
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Abstract. We examine deep neural network (DNN) performance and 
behavior using contrasting explanations generated from a semantically 
relevant latent space. We develop a semantically relevant latent space by 
training a variational autoencoder (VAE) augmented by a metric learning 
loss on the latent space. The properties of the VAE provide for a smooth 
latent space supported by a simple density and the metric learning term 
organizes the space in a semantically relevant way with respect to the 
target classes. In this space we can both linearly separate the classes 
and generate meaningful interpolation of contrasting data points across 
decision boundaries. This allows us to examine the DNN model beyond 
its performance on a test set for potential biases and its sensitivity to 
perturbations of individual factors disentangled in the latent space. 


Keywords: Deep learning - VAE - Metric learning - Interpretability - 
Explanation 


1 Introduction 


Advances in machine learning and deep learning have had a profound impact 
on many tasks involving high dimensional data such as object recognition and 
behavior monitoring. The domain of Computer Vision especially has been wit- 
nessing a great growth in bridging the gap between the capabilities of humans 
and machines. This field tries to enable machines to view the world as humans 
do, perceive it similar and even use the knowledge for a multitude of tasks such 
as Image & Video Recognition, Image Analysis and Classification, Media Recre- 
ation, recommender systems, etc. And, has since been implemented in high-level 
domains like COMPAS [8], healthcare [3] and politics [17]. However, as black- 
box models inner workings are still hardly understood, can lead to dangerous 
situations [3], such as racial bias [8], gender inequality [1]. 

The need for confidence, certainty, trust and explanations when using super- 
vised black-box models is substantial in domains with high responsibility. This 
paper provides an approach towards better understanding of a model’s predic- 
tions by investigating its behavior on semantically relevant (contrastive) expla- 
nations. The build a semantically relevant latent space we need a smooth space 
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that corresponds well with the generating factors of the data (i.e. regions well- 
supported by the associated density should correspond to realistic data points) 
and with a distance metric that conveys semantic information about the target 
task. The vanilla VAE without any extra constraints is insufficient as is does not 
necessarily deliver a distance metric that corresponds to the semantics of the tar- 
get class assignment (in our task). Our target is to develop semantically relevant 
decision boundaries in the latent space, which we can use to examine our tar- 
get classification model. Therefore, we propose to use a weakly-supervised VAE 
that uses a combination of metric learning and VAE disentanglement to create a 
semantically relevant, smooth and well separated space. And, we show that we 
can use this VAE and semantically relevant latent space can be used for various 
interpretability /explainability tasks, such as validate predictions made by the 
CNN, generate (contrastive) explanations when predictions are odd and being 
able to detect bias. The approach we propose for these tasks is more specifically 
explained using Fig. 1. 
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Fig. 1. The diagnostics approach to validate and understand the behavior of the CNN. 
(1) extra constraints, loss functions are applied during training of the VAE in order to 
create semantically relevant latent spaces. The generative model captures the essential 
semantics within the data and is used by (2) A linear Support Vector Machine. The 
linear SVM is trained on top of the latent space to classify input on semantics rather 
than the direct mapping from input data X and labels Y. If the SVM and CNN do 
not agree on a prediction then (3) we traverse the latent space in order to generate 
and capture semantically relevant synthetic images, tested against the CNN, in order 
to check what elements have to change in order to change its prediction from a to b, 
where a and b are different classes. 


In this paper, the key contributions are: (1) an approach that can be used in 
order to validate and check predictions made by a CNN by utilizing a weakly- 
supervised generative model that is trained to create semantically relevant latent 
spaces. (2) The semantically relevant latent spaces are then used in order to 
train a linear support vector machine to capture decision rules that define a 
class assignment. The SVM is then used to check predictions based on semantics 
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rather than the direct mapping of the CNN. (3) if there is a misalignment in 
the predictions (i.e. the CNN and SVM do not agree) then we posit the top k 
best candidates (classes) and for these candidates traverse the latent spaces in 
order to generate semantically relevant (contrastive) explanations by utilizing 
the decision boundaries of the SVM. 

To conclude, This paper posits a method that allows for the validation of 
CNN performance by comparing it against the linear classifier that is based 
on semantics and provides a framework that generates explanations when the 
classifiers do not agree. The explanations are provided qualitatively to an expert 
within the field. This explanation encompasses the original image, reconstructed 
images and the path towards its most probable answers. Additionally, it shows 
the minimal difference that makes the classifiers change its prediction to one of 
the most probable answers. The expert can then check these results to make a 
quick assessment to which class the image actually belongs to. Additionally, the 
framework provides the ability to further investigate the model mathematically 
using the linear classifier as a proxy model. 


2 Related Work 


Interest in interpretability and explainability studies has significantly grown 
since the inception of “Right to Explanation” [20] and ethicality studies into 
the behavior of machine learning models [1,3,8,17]. As a result, developers of AI 
are promoted and required, amongst others, to create algorithms that are trans- 
parent, non-discriminatory, robust and safe. Interpretability is most commonly 
used as an umbrella term and stands for providing insight into the behavior and 
thought processes behind machine-learning algorithms and many other terms 
are used for this phenomenon, such as, Interpretable AI, Explainable machine 
learning, causality, safe AI, computational social science, etc. [5]. We posit our 
research as an interpretability study, but it does not necessarily mean that other 
interpretability studies are directly closely related to this work. 

There have been many approaches that all work towards the goal of under- 
standing black-box models: Linear Proxy Models: Lime [18] are approaches that 
locally approximate complex models using linear fits, Decision trees and Rule 
extraction methods, such as deepred [21] are also considered highly explainable, 
but quickly become intractable as complexity increases and salience mapping 
[19] that provide visual information as to which part of an image is most likely 
used in its prediction, however, it has been demonstrated to be unreliable if not 
strongly conditioned [10]. Additionally, another approach to interpretability is 
explaining the role of each part within a black-box models such as the role of 
a layer or individual neurons [2] or representation vectors within the activation 
space [9]. 

Most of the approaches stated above assume that there has to be a trade- 
off between model performance and explainability. Additionally, as the current 
interpretable methods for black-box models are still insufficient and approxi- 
mated can cause more harm than good when communicated as a method that 
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solves all problems. A lot of the interpretability methods do not take into account 
the actual needs that stakeholders require [13]. Or, fail to take into account the 
vast research into explanations or interpretability of the field of psychology [14] 
and social sciences [15]. The “Explanation in Artificial Intelligence” study by 
Miller [15] describes the current state of interpretable and explainable algo- 
rithms, how most of the techniques currently fail to capture the essence of an 
explanation and how to improve: an interpretability or explainability method 
should at least include, but is not limited to, a non-disputable textual- and/or 
mathematical- and/or visual explanation that is selective, social and depending 
on the proof, contrastive. 

For this reason, our approach focuses on providing selective (contrastive) 
explanations that combines visual aspects as well as the ability to further inves- 
tigate the model mathematically using a proxy model that does not impact the 
CNN directly. Usually, generative models such as the Variational Autoencoders 
(VAE) [11] and Generative Adversarial Networks (GAN)s are unsupervised and 
used in order to sample and generate images from a latent space, provided by 
training the generative network. However, we posit to use a weakly-supervised 
generative network in order to impose (discriminative) structure in addition to 
variational inference to the latent space of said model using metric learning [6]. 

This approach and method is therefore most related to the interpretabil- 
ity area of sub-sampling proxy generative models to answer questions about a 
discriminative black box model. The two closest studies that attempt similar 
research is a preprint of CDeepEx [4] by Amir Feghahati et al. and xGEMs [7] 
by Joshi et al. Both cDeepEx and xGEMS propose the use of a proxy generative 
model in order to explain the behavior of a black-box model, primarily using 
generative adversarial networks (GANs). The xGems paper presents a frame- 
work to characterize and explaining binary classification models by generating 
manifold guided examples using a generative model. The behavior of the black 
box model is summarized by quantitatively perturbing data samples along the 
manifold. And, xGEMS detects and quantifies bias during model training to 
understand how bias affects black box models. The xGEMS approach is similar 
to our approach as in using a generative model in order to explain a black box 
model. Similarly, the cDeepEx paper posits their work as generating contrastive 
explanations using a proxy generative model. The generated explanations focus 
on answering the question “why a and not b?” with GANs, where a is the class 
of an input example J and b is a chosen class to which to capture the differences. 

However, both of these papers do not state that in a multi-class (discrimina- 
tive) classification problem if the generative models’ latent space is not smooth, 
well separated and semantically relevant then unexpected behavior can happen. 
For instance, when traversing the latent space it is possible to can pass from a 
to any number of classes before reaching class b because the space is not well 
separated and smooth. This will create ineffective explanations, as depending on 
how they generate explanations will give information on ‘why class a and not b 
using properties of c’. An exact geodesic path along the manifold would require 
great effort, especially in high dimensions. Also, our approach is different in the 
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fact that we utilize a weakly-supervised generative model as well as an extra 
linear classifier on top of the latent space to provide us with extra information 
on the data and the latent space. Some approaches we take, however, are very 
similar, such as using a generative model as a proxy to explain a black-box model 
as well as sub-sampling the latent space to probe the behavior of a black-box 
model and generate explanations using the predictions. 


3 Methodology 


This paper posits its methodology as a way to explain and validate decisions 
made by a CNN. The predictions made by the CNN are validated and explained 
utilizing the properties of a weakly-supervised proxy generative model, more 
specifically, a triplet-vae. There are three main factors that contribute to the 
validation and explanation of the CNN. First, a triplet-vae is trained in order 
to provide a semantically relevant and well separated latent space. Second, this 
latent space is then used to train an interpretable linear support vector machine 
and is used to validate decisions by the CNN by comparison. Third, when a 
CNN decision is misaligned with the decision boundaries in the latent space, we 
generate explanations through stating the K most probable answers as well as 
provide a qualitative explanation to validate the top K most probable answers. 
Each of these factors respectively refer to the number stated in Fig. 1 as well 
as link to each section: (1) triplet-vae Sect. 3.1, (2) CNN Decision Validation, 
Sect. 3.2, (3) Generating (contrastive) Explanations, Sect. 3.3. 


3.1 Semantically Relevant Latent Space 


Typically, a triplet network consists of three instances of a neural network that 
share parameters. These three instances are separately fed differences types of 
input: an anchor, positive sample and negative sample. These are then used to 
learn useful representations by distance comparisons. We propose to incorporate 
this notion of a triplet network to semantically structure and separate the latent 
space of the VAE using the available input and labels. A triplet VAE consists 
of three instances of the encoder with shared parameters that are each fed pre- 
computed triplets: an anchor, positive sample and negative sample; £a, £p and 
£n. The anchor x, and positive sample xp are of the same class but not the same 
image, whereas negative sample £n is from a different class. In each iteration 
of training, the input triplet is fed to the encoder network to get their mean 
latent embedding: F(xq)" = 24, F(xp)* = 2h, F(an)* = zh. These are then 
used to compute a similarity loss function as to induce loss when a negative 


sample zł% is closer to 24 than z% distance-wise. i.e. dap(z', 26) = ||zk — z$ |l 
and dan(zt, z#) = ||z# — z#|| and, provides us with three possible situations: 


dap > San; Sap < dan and dap = dan [6]. 

We wish to find an embedding where samples of a certain class lie close to 
each other in the latent space of the VAE. For this reason, we wish to add loss 
the algorithm when we arrive in the situation where dap > dan. In other words, 
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Fig. 2. Given an input image I we check the prediction of the CNN as well as the SVM. 
If both classifiers predict the same class, we return the predicted class. In contrast, if 
the classifiers do not predict the same class, we propose to return the top k most 
probable answers as well as an explanation why those classes are the most probable. 


we wish to push £n further away, such that we ultimately arrive in the situation 
where dap < dan Or dap = dan With some margin ¢. As such we arrive at the triplet 
loss function that we'll use in addition to the KL divergence and reconstruction 
loss within the VAE: L(z#, zh, zh) = a*argmax{||z — z6||—||zh — zh |l+¢ , 0}. 
Where ¢ will provide leeway when dap = dan and push the negative sample away 
even when the distances are equal. 

We have an already present CNN which we would like to validate, and is 
trained by input data X : x;...c, and labels Y : y;...yn where each y; states 
the true class of x;. We then use the same X and Y to train the triplet-VAE. 
(1) First, we compute triplets of the form x,,2p2, from the input data X and 
labels Y which are then used to train the triplet VAE. A typical VAE consists 
of an F(x) = Encoder(x) ~ q(z|x) which compresses the data into a latent 
space Z, a G(z) = Decoder(z) ~ p(x|z) which reconstructs the data given the 
latent space Z and a prior p(z), in our case a gaussian A/(0,1), imposed on 
the model. In order for the VAE to train a latent space similar to its prior 
and be able to reconstruct images it is trained by minimizing the Evidence 
Lower Bound (ELBO). ELBO = —E,W.9(2\x) log P(2|z)] + KL[Q(z|X)||P(z)] 
This can be explained as the reconstruction loss or expected negative loglikeli- 
hood: —E,.9(z|x) [log P(2|z)] and the KL divergence loss KL[Q(z|X)||P(z)], to 
which we add the triplet loss: 


L(za> 2p 1 Zn) = a x argmax{||z¢ — zp || — ||2a — zall + , 0} 


This compound loss semi-forces the latent space of the VAE to be well separated 
due to the triplet loss, disentangled due to the KL divergence loss combined with 
ß scalar, and provides a means of (reasonably) reconstructing images by the 
reconstruction loss. And, thus results in the following loss function for training 
the VAE: 


loss = —E,.9Q(z|x) [log P(2|z)] + 8 * KL[Q(z|X)||P(z)] + Lzh, 255 zh) 
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3.2 Decision Validation 


Afterwards, given a semantically relevant latent space we can use it for step two 
and three as indicated in Fig. 1. (2) Second step - CNN Decision Validation, we 
train an additional classifier on top of the triplet-VAE latent space, specifically 
z”. We train the linear Support Vector Machine using Z“s as input data and Y 
as labels where |Z”, Z7] = F(X). The goal of the linear support vector machine 
is two-fold. It provides a means of validating each prediction made by the CNN 
by using the encoder and the linear classifier. i.e. given an input example J, we 
have C(I) = Jer) and S(F(I)") = ĝs(r), and compare them against each other 
Yer) = Ys(1)- And, as the linear classifier is a simpler model than the highly 
complex CNN it will function as the ground-truth base for the predictions that 
are made. As such, we arrive at two possible cases: 


Comparison(I) = poe = (en ~ dsc) (1) 
Negative if (feq) # ĝsa)) 

First, If both classifiers agree then we arrive at an optimal state, meaning 
that the prediction is based on semantics and the direct mapping found by the 
CNN. In this way, we can say with high confidence that the prediction is correct. 
In the second case, if the classifiers do not agree, three cases can occur: the SVM 
is correct and the CNN is incorrect, the SVM is incorrect and the CNN is correct, 
or both the SVM and the CNN is incorrect. In each of these cases we can suggest 
a most probable answer as well as a selective (contrastive) explanation indicated 
as step 3 of the framework as explained in Fig. 2. 


3.3 Generating (contrastive) Explanations 


An explanation consists of (1) the most probable answers and (2) a qualitative 
investigation of latent traversal towards the most probable answers The most 
probable answer is presented by the averaged sum rule [12] over the predicted 
probabilities per class for both the CNN and SVM and selecting the top K 
answers, where K can be appropriately selected. Additionally, originally an SVM 
does not return a probabilistic answer, however, applying Platts [16] method we 
apply an additional sigmoid function to map the SVM outputs into probabilities. 
These top k answers are then used in order to present and generate selected 
contrastive explanations. 

The top K predictions or classes will be used in order to traverse and sub- 
sample the latent space from the initial representation or Z? location towards 
another class. We can find a path by finding the closest point within the latent 
space such that the decision boundary is crossed and the SVM predicts the tar- 
get class. Alternatively we could use the closest data point in the latent space 
that adheres to the training set argmin F (x;)” — Z? for every x; € X. Traversing 
and sub-sampling the latent space will change the semantics minimally to change 
the class prediction. We capture the minimal change needed in order to change 
both the SVM and CNN prediction to the target class. This information is then 
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Fig. 3. Generating (contrastive) explanations consist of several steps: First, given an 
input image J in question and the K top most probable answer. K denotes training 
data X for class k labeled with y = k. We feed both I and K through the encoder 
F(X) to receive their respective semantic location in the latent space. We then find 
the closest training point that belongs to the target class k and find the vector v; the 
direction of that point. Afterwards, uniformly sample e€ data points along this vector v, 
where j iterates over 0---j---¢ and is denoted as Z#. Z} is then used to check these 
against the SVM and use them to generate images Xz» using the decoder G(Z#). The 
generated images are then fed to the CNN to make a prediction and as the images will 
semantically change along the vector the prediction will change as well. Afterwards, we 
can compare the predictions from both the CNN and SVM. Subsequently, we use the 
first moment where both predictions are equal to target class k, denoted as moment l 
for generating an explanation - minimal semantic difference necessary to be equal to 
the target class, AU;. 


presented to the domain expert for verification and answers the following ques- 
tion: The most probable answer is a because the input image I is semantically 
closest to the following features, where the features are presented qualitatively. 
The explanations are generated as follows: see Fig. 3. 

The decision boundaries around the clusters within the latent space are fitted 
by the SVM and can be used to answer questions of the form ‘why a and not 
b?’. If ĝeçr) and sr) do not predict the same class, then, we assume that ĝs(7) 
is correct. We then use the find a path, indicated by v from jsp) to Hen), ZY 
to the target class. This can be done by calculating a vector orthogonal to the 
hyper-plane fitted by the SVM towards the target class. Alternatively, we can 
find the closest z” € Z” that satisfies %5 (24) = Jc(z») that are not the same as 
the initial prediction e(z). This means that v is the vector from J to the closest 
data point of the target class, with respect to Euclidean distance. 

We then uniformly sample points along vector v and check them against the 
SVM as well as the CNN. The sampled points can directly be fed to the SVM to 
get a prediction 9(f(v,;) for every v; € V. Similarly, we can get predictions of the 
CNN by transforming the images using the decoder D. The images are then fed to 
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the CNN to get a prediction §(C(D(v;)) for every v; € V. The predictions of both 
classifiers will change as the images start looking more and more like the target 
class as generative factors change along the vector. If we capture the changes that 
make the change happen, we can show the minimal difference required in order 
to change the prediction of the CNN. In this way we can generate contrastive 
examples: For the top ‘close’ class that is not y we answer the question: ‘why 
yr and not the other semantically close class?’. Hence, we find the answer to the 
question “why a and not 6?”, as the answer is the shortest approximate changes 
between the two classes that make the CNN change its prediction. As a result, 
we have found a way to validate the inner workings of the CNN. If there are 
doubts about a prediction it can be investigated and checked. 


4 Results 


In this paper we show experimental results on MNIST by generating (con- 
trastive) explanations to provide extra information to predictions made by the 
CNN and evaluate its performance. The creation of these explanations requires 
a semantically relevant and well separated latent space. Therefore, we first show 
the difference between the latent space of the vanilla VAE and the triplet-VAE 
and its effects on training a linear classifier on top of the latent space. The 
Figs. 4 and 5 show a tSNE visualization of the separation of classes within the 
latent space. Not surprisingly the triplet-VAE separated the data in a far more 
semantically relevant way and this is also reflected with respect to the accuracy 
of training a linear model on the data. 


Fig.4. Visualization of a two- Fig. 5. Visualization of a two- 
dimensional latent space of a vanilla dimensional latent space of a T-VAE 
VAE on MNIST on MNIST 


154 J. van Doorenmalen and V. Menkovski 


Table 1. This table shows the per- 
centages of agreement with respect 
to all possible cases. 


Second, the percentages show as to know 
how much both classifiers agree by showing 
the percentage per possible case, as shown in 
Table 1. Not surprisingly case four happens Sn 
more often than case three and can mean a se Y 0:3986 

; . (2) Ys = Yo #Y [0.003 
two things, our latent space is too simple (3) (Ws = Y) Z YZ 0.0086 
to capture the full complexity of the class (4) Ys # (Yo = Y) 0.0314 
assignment and the CNN is not constraint (5) Ys # Yo #Y (0.0044 
by extra loss functions. However, in three of 
the four cases where Ys 4 Yo we can explain 
the most probable predictions and provide a generated (contrastive) explanation. 
The only case we cannot check or know about is case two, where both Ys and 
Yo predict the same class but is wrong. The only way to capture this behavior 
is by explaining every single decision by generating explanations for everything. 
Nevertheless, as an example for generating explanations we use an example: 6783 
(case 5) as shown in Fig. 6. 

Generating explanations consists 
of three parts: First, we propose the 
top K probable answers: for this 
example the true label is 1, the most 
probable answers are 6, 8 and then 1 
with averaged probabilities 0,512332, 
0.3382, 0.1150. Second, Then for those 
most probable target classes, 6, 8, 1 
we traverse the latent space from the 
initial location Z/' to the closest point 


.  Fig.6. Once the SVM and the CNN both 
of that class, denoted as v € that is predict the target class we capture the min- 
predicted correctly i.e. the SVM and imal changes that are necessary to change 
CNN agree. Figure7 shows the gener- their predictions 
ated images from the uniformly sam- 
pled data points along vectors vz, € V where k € K stand for 6, 8, 1 in this case. 
The figures show which changes happen when traversing the latent space and at 
which points both the SVM and the CNN agree with respect to their decision. 
For the traversal from Z/' to class 6 it can be seen that rather quickly both 
classifiers agree and only minimal changes are required to change the predictions. 
Third, for such an occurrence we can further zoom in on what is happening 
and what really makes that the most probable answer. Figure6 shows these 
minimal changes required to change its prediction as well as the transformed 
image on which the classifiers agree. The first row shows the original image, 
positive changes, negative changes and the changes combined. The second row 
shows the reconstructed image and the reconstructed images with the positive 
changes, negative changes and positive and negative changes respectively. In this 
way, for each probable answer it shows its closest representative and the changes 
required to be part of that class. 
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Fig. 7. Per top k probable answers we traverse and sample the latent space to generate 
images that can be used to test the behavior of the CNN. The red line indicates the 
moment where both the SVM and the CNN predict the target class (Color figure online) 
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5 Conclusion 


This paper examines deep neural network’s behaviour and performance by uti- 
lizing a weakly-supervised generative model as a proxy. The weakly-supervised 
generative model aims to uncover the generative factors underlying the data 
and separate abstract classes by applying metric learning. The proxy’s goal is 
three-fold: the semantically meaningful space will be the base for a linear sup- 
port vector machine; The model’s generative capabilities will be used to generate 
images that can be probed against the black box in question; the latent space 
is traversed and sampled from an anchor J to another class k in order to find 
the minimal important difference that changes both classifier’s predictions. The 
goal of the framework is to be sure of the predictions made by the black box by 
better understanding the behaviour of the CNN by simulating questions of the 
form ‘Why a and not b?” where a and b are different classes. 

We examine deep neural network (DNN) performance and behaviour using 
contrasting explanations generated from a semantically relevant latent space. 
The results show that each of the above goals can be achieved and the frame- 
work performs as expected. We develop a semantically relevant latent space by 
training an variational autoencoder (VAE) augmented by a metric learning loss 
on the latent space. The properties of the VAE provide for a smooth latent space 
supported by a simple density and the metric learning term organizes the space 
in a semantically relevant way with respect to the target classes. In this space we 
can both linearly separate the classes and generate relevant interpolation of con- 
trasting data points across decision boundaries and find the minimal important 
difference that changes the classifier’s predictions. This allows us to examine the 
DNN model beyond its performance on a test set for potential biases and its 
sensitivity to perturbations of individual factors in the latent space. 
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Abstract. We introduce geometric pattern mining, the problem of find- 
ing recurring local structure in discrete, geometric matrices. It differs 
from existing pattern mining problems by identifying complex spatial 
relations between elements, resulting in arbitrarily shaped patterns. 
After we formalise this new type of pattern mining, we propose an 
approach to selecting a set of patterns using the Minimum Description 
Length principle. We demonstrate the potential of our approach by intro- 
ducing Vouw, a heuristic algorithm for mining exact geometric patterns. 
We show that Vouw delivers high-quality results with a synthetic bench- 
mark. 


1 Introduction 


Frequent pattern mining [1] is the well-known subfield of data mining that aims 
to find and extract recurring substructures from data, as a form of knowledge 
discovery. The generic concept of pattern mining has been instantiated for many 
different types of patterns, e.g., for item sets (in Boolean transaction data) and 
subgraphs (in graphs/networks). Little research, however, has been done on pat- 
tern mining for raster-based data, i.e., geometric matrices in which the row and 
column orders are fixed. The exception is geometric tiling [4,11], but that prob- 
lem only considers tiles, i.e., rectangular-shaped patterns, in Boolean data. 

In this paper we generalise this setting in two important ways. First, we 
consider geometric patterns of any shape that are geometrically connected, i.e., 
it must be possible to reach any element from any other element in a pattern by 
only traversing elements in that pattern. Second, we consider discrete geometric 
data with any number of possible values (which includes the Boolean case). We 
call the resulting problem geometric pattern mining. 

Figure 1 illustrates an example of geometric pattern mining. Figure la shows 
a 32 x 24 grayscale ‘geometric matrix’, with each element in [0,255], apparently 
filled with noise. If we take a closer look at all horizontal pairs of elements, 
however, we find that the pair (146,11) is, amongst others, more prevalent than 
expected from ‘random noise’ (Fig. 1b). If we would continue to try all combina- 
tions of elements that ‘stand out’ from the background noise, we would eventually 
find four copies of the letter ‘I’ set in 16 point Garamond Italic (Fig. 1c). 


© The Author(s) 2020 
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(a) 32 x 24 ‘geometric matrix’. (b) Pair (146, 11). (c) Pattern ‘T’ occurs four times. 


Fig. 1. Geometric pattern mining example. Each element is in [0, 255]. 


The 35 elements that make up a single ‘T’ in the example form what we call 
a geometric pattern. Since its four occurrences jointly cover a substantial part 
of the matrix, we could use this pattern to describe the matrix more succinctly 
than by 768 independent values. That is, we could describe it as the pattern ‘T 
at locations (5,4), (11,11), (20,3), (25,10) plus 628 independent values, hereby 
separating structure from accidental (noise) data. Since the latter description 
is shorter, we have compressed the data. At the same time we have learned 
something about the data, namely that it contains four I’s. This suggests that 
we can use compression as a criterion to find patterns that describe the data. 


Approach and Contributions. Our first contribution is that we introduce and 
formally define geometric pattern mining, i.e., the problem of finding recurring 
local structure in geometric, discrete matrices. Although we restrict the scope 
of this paper to two-dimensional data, the generic concept applies to higher 
dimensions. Potential applications include the analysis of satellite imagery, tex- 
ture recognition, and (pattern-based) clustering. 

We distinguish three types of geometric patterns: (1) exact patterns, which 
must appear exactly identical in the data to match; (2) fault-tolerant patterns, 
which may have noisy occurrences and are therefore better suited to noisy data; 
and (3) transformation-equivalent patterns, which are identical after some trans- 
formation (such as mirror, inverse, rotate, etc.). Each consecutive type makes 
the problem more expressive and hence more complex. In this initial paper we 
therefore restrict the scope to the first, exact type. 

As many geometric patterns can be found in a typical matrix, it is crucial to 
find a compact set of patterns that together describe the structure in the data 
well. We regard this as a model selection problem, where a model is defined by 
a set of patterns. Following our observation above, that geometric patterns can 
be used to compress the data, our second contribution is the formalisation of 
the model selection problem by using the Minimum Description Length (MDL) 
principle [5,8]. Central to MDL is the notion that ‘learning’ can be thought 
of as ‘finding regularity’ and that regularity itself is a property of data that 
is exploited by compressing said data. This matches very well with the goals of 
pattern mining, as a result of which the MDL principle has proven very successful 
for MDL-based pattern mining [7,12]. 
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Finally, our third contribution is Vouw, a heuristic algorithm for MDL-based 
geometric pattern mining that (1) finds compact yet descriptive sets of patterns, 
(2) requires no parameters, and (3) is tolerant to noise in the data (but not 
in the occurrences of the patterns). We empirically evaluate Vouw on synthetic 
data and demonstrate that it is able to accurately recover planted patterns. 


2 Related Work 


As the first pattern mining approach using the MDL principle, Krimp [12] was 
one of the main sources of inspiration for this paper. Many papers on pattern- 
based modelling using MDL have appeared since, both improving search, e.g., 
Slim [10], and extensions to other problems, e.g., Classy [7] for rule-based clas- 
sification. 

The problem closest to ours is probably that of geometric tiling, as introduced 
by Gionis et al. [4] and later also combined with the MDL principle by Tatti 
and Vreeken [11]. Geometric tiling, however, is limited to Boolean data and 
rectangularly shaped patterns (tiles); we strongly relax both these limitations 
(but as of yet do not support patterns based on densities or noisy occurrences). 

Campana et al. [2] also use matrix-like data (textures) in a compression- 
based similarity measure. Their method, however, has less value for explanatory 
analysis as it relies on generic compression algorithms that are essentially a black 
box. 

Geometric pattern mining is different from graph mining, although the con- 
cept of a matrix can be redefined as a grid-like graph where each node has a 
fixed degree. This is the approach taken by Deville et al. [3], solving a problem 
similar to ours but using an approach akin to bag-of-words instead of the MDL 
principle. 


3 Geometric Pattern Mining Using MDL 


We define geometric pattern mining on bounded, discrete and two-dimensional 
raster-based data. We represent this data as an M x N matrix A whose rows 
and columns are finite and in a fixed ordering (i.e., reordering rows and columns 
semantically alters the matrix). Each element a;,; E€ S, where row i € [0; N), 
column j € [0; M), and S is a finite set of symbols, i.e., the alphabet of A. 
According to the MDL principle, the shortest (optimal) description of A 
reveals all structure of A in the most succinct way possible. This optimal descrip- 
tion is only optimal if we can unambiguously reconstruct A from it and nothing 
more—the compression is both minimal and lossless. Figure 2 illustrates how an 
example matrix could be succinctly described using patterns: matrix A is decom- 
posed into patterns X and Y. A set of such patterns constitutes the model for 
a matrix A, denoted H4 (or H for short when A is clear from the context). In 
order to reconstruct A from this model, we also need a mapping from the HA 
back to A. This mapping represents what (two-part) MDL calls the data given 
the model H4. In this context we can think of this as a set of all instructions 
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required to rebuild A from H4, which we call the instantiation of H4 and is 
denoted by J in the example. These concepts allow us to express matrix A as 
a decomposition into sets of local and global spatial information, which we will 
next describe in more detail. 


[utpekte frst] 


11...1 Y 


Fig. 2. Example decomposition of A into instantiation J and patterns X,Y. 


3.1 Patterns and Instances 


> We define a pattern as an Mx x Nx submatrix X of the original matrix 
A. Elements of this submatrix may be -, the empty element, which gives us the 
ability to cut-out any irregular-shaped part of A. We additionally require the 
elements of X to be adjacent (horizontal, vertical or diagonal) to at least one 
non-empty element and that no rows and columns are empty. 

From this definition, the dimensions Mx x Nx give the smallest rectangle 
around X (the bounding bor). We also define the cardinality |X| of X as the 
number of non-empty elements. We call a pattern X with |X| = 1 a singleton 
pattern, i.e., a pattern containing exactly one element of A. 

Each pattern contains a special pivot element: pivot(X) is the first non- 
empty element of X. A pivot can be thought of as a fixed point in X which 
we can use to position its elements in relation to A. This translation, or offset, 
is a tuple q = (i,j) that is on the same domain as an index in A. We realise 
this translation by placing all elements of X in an empty M x X size matrix 
such that the pivot element is at (i, j). We formalise this in the instantiation 
operator &: 


> We define the instance X (i,j) as the M x N matriz containing all elements 
of X such that pivot(X) is at index (i, j) and the distances between all elements 
are preserved. The resulting matrix contains no additional non-empty elements. 

Since this does not yield valid results for arbitrary offsets (i, j), we enforce two 
constraints: (1) an instance must be well-defined: placing pivot(X) at index 
(i, 7) must result in an instance that contains all elements of X, and (2) elements 
of instances cannot overlap, i.e., each element of A can be described only once. 


> Two pattern instances X ®q and Y @r, with q # r are non-overlapping if 
\(X @q)+(¥ @r)| =|X/+I¥L. 

From here on we will use the same letter in lower case to denote an arbitrary 
instance of a pattern, e.g., £ = X & q when the exact value of q is unimportant. 
Since instances are simply patterns projected onto an M x N matrix, we can 
reverse ® by removing all completely empty rows and columns: 


> Let X Qq be an instance of X, then by definition we say that O(X 8q) = X. 
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We briefly introduced the instantiation I as a set of ‘instructions’ of where 
instances of each pattern should be positioned in order to obtain A. As Fig. 2 
suggests, this mapping has the shape of an M x N matrix. 


> Given a set of patterns H, the instantiation (matrix) I is an M x N matrix 
such that I;,; € H U {-} for all (i, j), where - denotes the empty element. For all 
non-empty Ti j it holds that I;,; Q (i,j) is a non-overlapping instance of I; j in A. 


3.2 The Problem and Its Solution Space 


Larger patterns can be naturally constructed by joining (or merging) smaller 
patterns in a bottom-up fashion. To limit the considered patterns to those rele- 
vant to A, instances can be used as an intermediate step. As Fig. 3 demonstrates, 
we can use a simple element-wise matrix addition to sum two instances and use 
© to obtain a joined pattern. Here we start by instantiating X and Y with offsets 
(1,0) and (1,1), respectively. We add the resulting x and y to obtain @z, the 
union of X and Y with relative offset (1,1) — (1,0) = (0,1). 


E+ Yy= 


e=x00.0-[i;} vzroan=fi 
-1 


iil z=oe+n= [h] 


Fig. 3. Example of joining patterns X and Y to construct a new pattern Z. 


The Sets Ha and T4. We define the model class H as the set of all possi- 
ble models for all possible inputs. Without any prior knowledge, this would be 
the search space. To simplify the search, however, we only consider the more 
bounded subset H4 of all possible models for A, and Z4, the set of all possible 
instantiations for these models. To this end we first define H9 to be the model 
with only singleton patterns, i.e., H9 = S, and denote its corresponding instan- 
tiation matrix by J9: Given that each element of 1 must correspond to exactly 
one element of A in HÌ, we see that each I; j = a; j and so we have I9 = A. 
Using H% and I9 as base cases we can now inductively define T4: 


Base case I9 ETA 
By induction If J is in Z4 then take any pair I; j, Ik, € I such that (i, j) < (k,l) 
in lexicographical order. Then the set I’ is also in Z4, providing I’ equals I 


except: L; = (l; ® (4,5) + Ina @ (&,D) 


Ra 
Lagi 


This shows we can add any two instances together, in any order, as they are by 
definition always non-overlapping and thus valid in A, and hereby obtain another 
element of Z4. Eventually this results in just one big instance that is equal to 
A. Note that when we take two elements T; j, Iı € I we force (i, j) < (k,l), not 
only to eliminate different routes to the same instance matrix, but also so that 
the pivot of the new pattern coincides with I; j. We can then leave Iı empty. 
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The construction of Z4 also implicitly defines Ha. While this may seem 
odd—defining models for instantiations instead of the other way around—note 
that there is no unambiguous way to find one instantiation for a given model. 
Instead we find the following definition by applying the inductive construction: 


Ha = {{O(z) | x € I} | Le Ty}. (1) 


So for any instantiation J € Z4 there is a corresponding set in H4 of all patterns 
that occur in J. This results in an interesting symbiosis between model and 
instantiation: increasing the complexity of one decreases that of the other. This 
construction gives a tightly connected lattice as shown in Fig. 4. 


3.3 Encoding Models and Instances 


From all models in H4 we want to select the model that describes A best. 
Two-part MDL [5] tells us to choose that model that minimises the sum of 
[,(H4) + L2(A|H,), where Lı and Lo are two functions that give the length 
of the model and the length of ‘the data given the model’, respectively. In this 
context, the data given the model is given by I4, which represents the accidental 
information needed to reconstruct the data A from Hy. 


v I a aan M 
i [xY] ği : 
10] [$> ea] [v] 
S S a aF Vw 
tall Ta Fo.) fwy 
L ` 10] Le: 
| [vw 
fo [vy o pe 
2] -X o1 fw. zZ f 
x Y I af = 1 Lx [3 a] is ] 
XY r 10 
[o] [2] E o] fwy 


3 

< xX 
3o x 
SO 


o1] v 10 


Fig. 4. Model space lattice for a 2 x 2 Boolean matrix. The V, W, and Z columns show 
which pattern is added in each step, while J depicts the current instantiation. 


In order to compute their lengths, we need to decide how to encode H4 and 
I. As this encoding is of great influence on the outcome, we should adhere to 
the conditions that follow from MDL theory: (1) the model and data must be 
encoded losslessly; and (2) the encoding should be as concise as possible, i.e., it 
should be optimal. Note that for the purpose of model selection we only need 
the length functions; we do not need to actually encode the patterns or data. 


Code Length Functions. Although the patterns in H and instantiation matrix 
I are all matrices, they have different characteristics and thus require different 
encodings. For example, the size of I is constant and can be ignored, while the 
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Table 1. Code length definitions. Each row specifies the code length given by the first 
column as the sum of the remaining terms. 


Matrix Bounds | # Elements Positions | Symbols 
L,(X) | Pattern log(MN) Ly (MRI) |X| log(|.S|) 
Lı (H) | Model N/A Ln (|A|) N/A xen Lp(X) 
L(I) | Instantiation | Constant | log(M N) Implicit | Lpp(Z) 


sizes of the patterns vary and should be encoded. Hence we construct different 
length functions! for the different components of H and J, as listed in Table 1. 

When encoding I, we observe that it contains each pattern X € H multiple 
times, given by the usage of X. Using the prequential plug-in code [5] to 
encode J enables us to omit encoding these usages separately, which would cre- 
ate unwanted bias. The prequential plug-in code gives us the following length 
function for I. We use e€ = 0.5 and elaborate on its derivation in the Appendix?. 


|| eee : 
be (P A fos resas HO eae ra =" Dy 
XiCh 


Each length function has four terms. First we encode the total size of the 
matrix. Since we assume MN to be known/constant, we can use this constant to 
define the uniform distribution arate so that log MN encodes an arbitrary index 
of A. Next we encode the number of elements that are non-empty. For patterns 
this value is encoded together with the third term, namely the positions of the 
non-empty elements. We use the previously encoded Mx Nx in the binominal 
function to enumerate the ways we can place the |X| elements onto a grid of 
Mx Nx. This gives us both how many non-empties there are as well as where 
they are. Finally the fourth term is the length of the actual symbols that encode 
the elements of the matrix. In case we encode single elements of A, we assume 
that each unique value in A occurs with equal probability; without other prior 
knowledge, using the uniform distribution has minimax regret and is therefore 
optimal. For the instance matrix, which encodes symbols to patterns, the pre- 
quential code is used as demonstrated before. Note that Ly is the universal prior 
for the integers [9], which can be used for arbitrary integers and penalises larger 
integers. 


4 The Vouw Algorithm 


Pattern mining often yields vast search spaces and geometric pattern mining is 
no exception. We therefore use a heuristic approach, as is common in MDL-based 
approaches [7,10,12]. We devise a greedy algorithm that exploits the inductive 


1 We calculate code lengths in bits and therefore all logarithms have base 2. 
? The appendix is available on https://arxiv.org/abs/1911.09587. 
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definition of the search space as shown by the lattice in Fig. 4. We start with a 
completely underfit model (leftmost in the lattice), where there is one instance for 
each matrix element. Next, in each iteration we combine two patterns, resulting 
in one or more pairs of instances to be merged (i.e., we move one step right in the 
lattice). In each step we merge the pair of patterns that improves compression 
most, and we repeat this until no improvement is possible. 


4.1 Finding Candidates 


The first step is to find the ‘best’ candidate pair of patterns for merging 
(Algorithm 1). A candidate is denoted as a tuple (X,Y, ô), where X and Y are pat- 
terns and ô is the relative offset of X and Y as they occur in the data. Since we only 
need to consider pairs of patterns and offsets that actually occur in the instance 
matrix, we can directly enumerate candidates from the instantiation matrix and 
never even need to consider the original data. 


Algorithm 1 FindCandidates Algorithm 2 Vouw 
Input: IJ Input: H, I 


Output: C 1: C + FindCandidates(I) 

1: for all x € I do 2: (X,Y,6) € C : Veeco AL((X, Y,6)) < AL(c) 
2: for all y € POST(zx) do 3: ALves: = AL((X, Y,4)) 

3: X — Q(x), Y — (y) 4: if ALrest > 0 then 

4: ô — dist(X, Y) 5: Z — Ø(X 2 (0,0) + (Y 8 ô)) 

5: if X =Y then 6& H=HU{Z} 

6: if V(x)[e] = 1 continue 7: for all z; € I | @(ai) = X do 

T: V(y)[e] — 1 8: for all y € POST(z;) | O(y) = Y do 
8: end if 9: Lic Z, y=.: 

9: C-C U (X,Y, ô) 10: end for 
10: sup(X, Y, ô) += 1 11: end for 
11: end for 12: end if 
12: end for 13: repeat until ALbest < 0 


The support of a candidate, written sup(X, Y, ô), tells how often it is found 
in the instance matrix. Computing support is not completely trivial, as one can- 
didate occurs multiple times in ‘mirrored’ configurations, such as (X,Y, ô) and 
(Y, X,—ô), which are equivalent but can still be found separately. Furthermore, 
due to the definition of a pattern, many potential candidates cannot be consid- 
ered by the simple fact that their elements are not adjacent. 


Peripheries. For each instance x we define its periphery: the set of instances 
which are positioned such that their union with x produces a valid pattern. This 
set is split into anterior ANT(X) and posterior POST(X) peripheries, contain- 
ing instances that come before and after x in lexicographical order, respectively. 
This enables us to scan the instance matrix once, in lexicographical order. For 
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each instance x, we only consider the instances POST(x) as candidates, thereby 
eliminating any (mirrored) duplicates. 


Self-overlap. Self-overlap happens for candidates of the form (X, X, ô). In this 
case, too many or too few copies may be counted. Take for example a straight 
line of five instances of X. There are four unique pairs of two X’s, but only two 
can be merged at a time, in three different ways. Therefore, when considering 
candidates of the form (X,X,6), we also compute an overlap coefficient. This 
coefficient e is given by e = (2Nx +1)6;+6; + Nx, which essentially transforms 
ô into a one-dimensional coordinate space of all possible ways that X could be 
arranged after and adjacent to itself. For each instance x, a vector of bits V (x) 
is used to remember if we have already encountered a combination 71,22 with 
coefficient e, such that we do not count a combination x2,23 with an equal e. 
This eliminates the problem of incorrect counting due to self-overlap. 


4.2 Gain Computation 


After candidate search we have a set of candidates C and their respective sup- 
ports. The next step is to select the candidate that gives the best gain: the 
improvement in compression by merging the candidate pair of patterns. For 
each candidate c = (X,Y,6) the gain AL(A’,c) is comprised of two parts: (1) 
the negative gain of adding the union pattern Z to the model H, resulting in 
H’, and (2) the gain of replacing all instances x, y with relative offset ô by Z in 
I, resulting in I’. We use length functions L1, Lo to derive an equation for gain: 


AL(A',c) = (La(H") + Lo(1")) — (Zi (#) + L2(D)) ’ 
= Ln (|H) - Lw(\H| + 1) - Lp(Z) + (L20) - La) 


As we can see, the terms with L; are simplified to —L,(Z) and the model’s 
length because Lı is simply a summation of individual pattern lengths. The 
equation of Lə requires the recomputation of the entire instance matrix’ length, 
which is expensive considering we need to perform it for every candidate, every 
iteration. However, we can rework the function L,, in Eq. (2) by observing that 
we can isolate the logarithms and generalise them into 


I'(a+ be) 


Tbe) = log I (a + be) — log T (be), (4) 


logg(a, b) = log 


which can be used to rework the second part of Eq. (3) in such way that the gain 
equation can be computed in constant time complexity. 


L3(I’) — L(I) =logg(U(X), 1) + logg(U(¥), 1) 
— logg(U(X) — U(Z), 1) — logg(U(Y) —U(Z),1) (5) 
— logg(U(Z), 1) + loge (||, |Z) — log (21, |H") 


Notice that in some cases the usages of X and Y are equal to that of Z, which 
means additional gain is created by removing X and Y from the model. 
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4.3 Mining a Set of Patterns 


In the second part of the algorithm, listed in Algorithm 2, we select the candi- 
date (X,Y, ô) with the largest gain and merge X and Y to form Z, as explained 
in Sect. 3.2. We linearly traverse I to replace all instances x and y with relative 
offset ô by instances of Z. (X,Y,6) was constructed by looking in the posterior 
periphery of all x to find Y and 6, which means that Y always comes after X in 
lexicographical order. The pivot of a pattern is the first element in lexicograph- 
ical order, therefore pivot(Z) = pivot(X). This means that we can replace all 
matching x with an instance of Z and all matching y with -. 


4.4 Improvements 


Local Search. To improve the efficiency of finding large patterns without sac- 
rificing the underlying idea of the original heuristics, we add an optional local 
search. Observe that without local search, Vouw generates a large pattern X 


(a) Generated matrix (b) Ground truth (c) Found patterns (d) Difference 


Fig. 5. Synthetic patterns are added to a matrix filled with noise. The difference 
between the ground truth and the matrix reconstructed by the algorithm is used to 
compute precision and recall. 
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Fig. 6. The influence of SNR in the ground truth (left) and prevalence on recall (right) 
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by adding small elements to an incrementally growing pattern, resulting in a 
behaviour that requires up to |X| — 1 steps. To speed this up, we can try to 
‘predict’ which elements will be added to X and add them immediately. After 
selecting candidate (X,Y,6) and merging X and Y into Z, for all m resulting 
instances z; € Zo,.--,;2m—1 we try to find pattern W and offset 6 such that 


This yields zero or more candidates (Z,W,6), which are then treated as any 
set of candidates: candidates with the highest gain are iteratively merged until 
no candidates with positive gain exist. This essentially means that we run the 
baseline algorithm only on the peripheries of all z;, with the condition that the 
support of the candidates is equal to that of Z. 


Reusing Candidates. We can improve performance by reusing the candidate 
set and slightly changing the search heuristic of the algorithm. The Best-* 
heuristic selects multiple candidates on each iteration, as opposed to the baseline 
Best-1 heuristic that only selects a single candidate with the highest gain. Best-* 
selects candidates in descending order of gain until no candidates with positive 
gain are left. Furthermore we only consider candidates that are all disjoint, 
because when we merge candidate (X, Y, ô), remaining candidates with X and/or 
Y have unknown support and therefore unknown gain. 


5 Experiments 


To asses Vouw’s practical performance we primarily use Ril, a synthetic dataset 
generator developed for this purpose. Ril utilises random walks to populate a 
matrix with patterns of a given size and prevalence, up to a specified density, 
while filling the remainder of the matrix with noise. Both the pattern elements 
and the noise are picked from the same uniform random distribution on the 
interval [0,255]. The signal-to-noise ratio (SNR) of the data is defined as the 
number of pattern elements over the matrix size MN. The objective of the 
experiment is to assess whether Vouw recovers all of the signal (the patterns) 
and none of the noise. Figure 5 gives an example of the generated data and how 
it is evaluated. A more extensive description can be found in the Appendix (see 
footnote 2). 


Implementation. The implementation® used consists of the Vouw algorithm 
(written in vanilla C/C++), a GUI, and the synthetic benchmark Ril. Experi- 
ments were performed on an Intel Xeon-E2630v3 with 512GB RAM. 


Evaluation. Completely random data (noise) is unlikely to be compressed. The 
SNR tells us how much of the data is noise and thus conveniently gives us an 
upper bound of how much compression could be achieved. We use the ground 
truth SNR versus the resulting compression ratio as a benchmark to tell us how 
close we are to finding all the structure in the ground truth. 


3 https: //github.com/mickymuis/libvouw. 
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In addition, we also compare the ground truth matrix to the obtained model 
and instantiation. As singleton patterns do not yield any compression over the 
baseline model, we reconstruct the matrix omitting any singleton patterns. Ignor- 
ing the actual values, this gives us a Boolean matrix with ‘positives’ (pattern 
occurrence = signal) and ‘negatives’ (no pattern = noise). By comparing each ele- 
ment in this matrix with the corresponding element in the ground truth matrix, 
precision and recall can be calculated and evaluated. 

Figure 6 (left) shows the influence of ground truth SNR on compression ratio 
for different matrix sizes. Compression ratio and SNR are clearly strongly cor- 
related. Figure 6 (right) shows that patterns with a low prevalence (i.e., number 
of planted occurrences) have a lower probability of being ‘detected’ by the algo- 
rithm as they are more likely to be accidental/noise. Increasing the matrix size 
also increases this threshold. In Table2 we look at the influence of the two 
improvements upon the baseline algorithm as described in Sect. 4.4. In terms 
of quality, local search can improve the results quite substantially while Best-* 
notably lowers precision. Both improve speed by an order of magnitude. 


Table 2. Performance measurements for the baseline algorithm and its optimisations. 


Size SNR Precision/Recall Average time 
None Local Best-* Both None Local Best-* Both 
256 .05 .98/.98 .99/.99 .93/.98 .95/.99 29s 1s 2s 1s 
3 .99/.8 .99/.88 .96/.82 .99/.89 2m 32s 9s 5s 5s 
512 .05 .98/.97 .99/.99 .87/.97 .93/.98 5m 26s 8s 20s 6s 
3 .97/.93 .99/.99 .94/.91 .97/.90 26m52s2m32s 24s 65s 
1024 .05 .97/.98 .99/.99 .84/.98 .92/.96 21m 34s 44s 37s 34s 
3 .98/.98 .99/.99 .93/.96 .98/.97 116m 4s 7m 31s 1m 49s 3m 31s 


6 Conclusions 


We introduced geometric pattern mining, the problem of finding recurring struc- 
tures in discrete, geometric matrices, or raster-based data. Further, we presented 
Vouw, a heuristic algorithm for finding sets of geometric patterns that are good 
descriptions according to the MDL principle. It is capable of accurately recover- 
ing patterns from synthetic data, and the resulting compression ratios are on par 
with the expectations based on the density of the data. For the future, we think 
that extensions to fault-tolerant patterns and clustering have large potential. 
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Abstract. Nonnegative Matrix Factorization (NMF) which was origi- 
nally designed for dimensionality reduction has received throughout the 
years a tremendous amount of attention for clustering purposes in several 
fields such as image processing or text mining. However, despite its math- 
ematical elegance and simplicity, NMF has exposed a main issue which 
is its strong sensitivity to starting points, resulting in NMF struggling 
to converge toward an optimal solution. On another hand, we came to 
explore and discovered that even after providing a meaningful initializa- 
tion, selecting the solution with the best local minimum was not always 
leading to the one having the best clustering quality, but somehow a bet- 
ter clustering could be obtained with a solution slightly off in terms of 
criterion. Therefore in this paper, we undertake to study the clustering 
characteristics and quality of a set of NMF best solutions and provide a 
method delivering a better partition using a consensus made of the best 
NMF solutions. 


Keywords: NMF - Clustering - Clustering ensemble - Consensus 


1 Introduction 


When dealing with text data, document clustering techniques allow to divide 
a set of documents into groups so that documents assigned to the same group 
are more similar to each other than to documents assigned to other groups 
(12,18, 21,22]. In information retrieval, the use of clustering relies on the assump- 
tion that if a document is relevant to a query, then other documents in the same 
cluster can also be relevant. This hypothesis can be used at different stages 
in the information retrieval process, the two most notable being: cluster-based 
retrieval to speed up search, and search result clustering to help users navigate 
and understand what is in the search results. The document clustering which 
still remains a hot topic can be tackled under different approaches. In our con- 
tribution we rely on the non-negative matrix factorization for its simplicity and 
popularity. We will not propose a new variant of NMF but rather a consensus 
approach that will boost its performance. 

Unlike supervised learning, the evaluation of clustering algorithms - unsuper- 
vised learning - remains a difficult problem. When relying on generative models, 
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it is easier to evaluate the performance of a given clustering algorithm based 
on the simulated partition. On real data already labeled, many papers evaluate 
the performance of clustering algorithms by relying on indices such as Accuracy 
(ACC), Normalized Mutual Information (NMI) [25] and Adjusted Rand Index 
(ARI) [14]. However, the algorithms commonly used which are of type k-means, 
EM [8], Classification EM [6], NMF [15] etc. are iterative and require several ini- 
tializations; the resulting partition is the one optimizing the objective function. 
Sometimes in these works, we observe comparative studies between methods on 
the basis of maximum ACC/NMI/ARI measures obtained after several initializa- 
tions and not optimizing the criterion used in the algorithm. Such a comparison 
is thereby not accurate, because in fact these measures cannot be calculated in 
practice and cannot be used in this way to evaluate the quality of a clustering 
algorithm. 

A fair comparison can only be made on the basis of objective functions con- 
sidered in a clustering purpose; for example, within-cluster inertia, likelihood, 
classification likelihood for mixture models, factorization, etc. Nonetheless, in 
our experiences, we realized that while the clustering results become better in 
terms of ACC/NMI/ARI when the objective function value increases, the best 
value is not necessarily associated with the best results. However, by ranking 
the objective values, the best partition tends to be among those leading to the 
first best scores. We illustrate this behavior in Fig. 4. This remark leads us to 
consider an ensemble method that is widely used in supervised learning [11,24] 
but a little less in unsupervised learning [25]. If this approach, referred to as con- 
sensus clustering, is often used in the context of comparing partitions obtained 
with different algorithms, it is less studied considering the same algorithm. 

The paper is organized as follows. In Sect.2, we review the nonnegative 
matrix factorization with the Frobenius norm and the Kullback—Leibler diver- 
gence. Section 3 is devoted to describe the ensemble method and the popular 
used algorithms. In Sect. 4, we perform comparisons on document-term matrices 
and propose a strategy to improve document clustering with NMF. 


2 Nonnegative Matrix Factorization 


Nonnegative Matrix Factorization (NMF) [15], aiming to deliver a lower rank 
decomposition of a nonnegative data matrix X has highlighted clustering prop- 
erties for which strong connections with K-means or Spectral clustering can be 
drawn [16]. However, while several variants arise in order to accommodate its 
clustering property [10, 29-31], its premier model formulation does not involve a 
clustering objective and was originally presented as a dimension reduction algo- 
rithm with exclusive nonnegative factors. More specifically in text mining where 
NMF produces a meaningful interpretation for document-term matrices in com- 
parison with methods like Singular Value Decomposition (SVD) components or 
Latent Semantic Analysis (LSA) [7] arising factors with possible negative values. 
NMF seeks to approximate a matrix X € a by the product of two lower rank 
matrices Z € RY” and W € RII with g(n + d) < ng. This problem can be 
formulated as a constrained optimization problem 
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F(Z,W)= min. D(X,ZW') (1) 
Z>0,W>0 

where D is a fitting error allowing to measure the quality of the approximation 
of X by ZW |, the most popular ones being the Frobenius norm and Kullback- 
Leibler (KL) divergence. For a clustering setup, Z will be referred to as the 
soft classification matrix while W will be the centers matrix. Despite its mul- 
tiple applications benefits, NMF has a recurrent downside which takes place at 
its initialization. NMF provides a different solution for every different initial- 
isation making it substantially sensitive to starting points as its convergence 
directly relies on the characteristics of the given entries. Several publications 
have shown interest in finding the best way to start a NMF algorithm by provid- 
ing a structured initialization, in some cases obtained from results of clustering 
algorithms such as k-means or Spherical K-means [27,28] (especially for applying 
NMF on document-term matrices), Nonnegative Singular Value decomposition 
(NNDSVD) [4] or SVD based strategies [17]. The optimization procedures for 
D respectively equal to the Frobenius norm and the KL divergence, based on 
multiplicative update rules are given in Algorithms 1 and 2. 


Algorithm 1. (NMF-F). Algorithm 2. (NMF-KL). 

Input: X, g, ZO. Ww, Input: X, g, ZO. w), 

Output: Z and W. Output: Z and W. 

repeat repeat 

x ; 

I.Z- ZO zotmi 1B ZO (zr W)/ 2 Wie 
2. W — Wo: 2. W — W O (srr Z)/ X; Zir; 

until convergence until convergence 

5. Normalize Z so as it has unit-length 5. Normalize Z so as it has unit-length 

column vectors. column vectors. 


3 Cluster Ensembles (CE) 


In machine learning, the idea of utilizing multiple sources of data partitions 
firstly occurred with multi-learner systems where the output of several classifier 
algorithms where used together in order to improve the accuracy and robustness 
of a classification or regression, for which strong performances were acknowl- 
edged [24,25]. At this stage, very few approaches have worked toward applying 
a similar concept to unsupervised learning algorithms. In this sense, we denote 
the work of [5] who tried to combine several clustering partitions according to 
the combination of the cluster centers. In the early 2000, [25] were the first to 
consider an idea of combining several data partitions however, without accessing 
any original sources of information (features) or led computed centers. This app- 
roach is referred to as cluster ensembles. At the time, their idea was motivated 
by the possibilities of taking advantage of existing information such as a prior 
clustering partitions or an expert categorization (all regrouped under the terms 
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Knowledge Reuse), which may still be relevant or substantial for a user to con- 
sider in a new analysis on the same objects, whether or not the data associated 
with these objects may also be different than the ones used to define the prior 
partitions. Another motivation was Distributed computing, referring to analyz- 
ing different sources of data (which might be complicated to merge together for 
instance for privacy reasons) stored in different locations. In our concept, we will 
use cluster ensembles to improve the quality of the final partition (as opposed to 
selecting a unique one) and therefore extract all the possibilities offered by the 
miscellaneous best solutions created by NMF. 

In [25], the authors introduced three consensus methods that can produce a 
partition. All of them consider the consensus problem on a hypergraph represen- 
tation H of the set of partitions H”. More specifically, each partition H” equals 
a binary classification matrix (with objects in rows and clusters in columns) 
where the concatenation of all the set defines the hypergraph H. 


— The first one is called Cluster-based Similarity Partitioning Algorithm 
(CSPA) and consists in performing a clustering on the hypergraph according 
to a similarity measure. 

— The second is referred to as HyperGraph Partitioning Algorithm (HGPA) 
and aims at optimizing a minimum cut objective. 

— The third one is called Meta-CLustering Algorithm (MCLA) and looks for- 
ward to identifying and constructing groups of clusters. 


Furthermore, in [25] the authors proposed an objective function to charac- 
terize the cluster ensembles problem and therefore allowing a selection of the 
best consensus algorithm among the three to deliver its ensemble partition. Let 
A= {\|q € {1,...,r}} be a given set of r partitions A‘ represented as labels 
vectors. The ensemble criterion denoted as \‘*~°?%) is called the optimal combine 
clustering and aims at maximizing the Average Normalized Mutual Information 
(ANMI). It is defined as follows: 


dk-opt) — argmax X` NMI(A, A) (2) 
= 


The ANMI is simply the average of the normalized mutual information of a 
labels vector À with all labels vectors \ in A: 


ANMI(A, À) = : 3 NMI(A, AC) (3) 


q=1 


To cast with cases where the vector labels A‘ have missing values, the authors 
have proposed a generalized expression of (2) not substantially different that 
viewers can refer to in the original paper [25]. 
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4 Experiments 


We conduct several experiences leading to emphasise the behavior of NMF 
regarding a clustering task compared to a dedicated clustering algorithm such 
as Spherical K-means referred to as S-Kmeans [9] which was introduced for clus- 
tering large sets of sparse text data (or directional data) and remains appealing 
for its low computational cost beside its good performances. It was also retained 
along side the random starting points (generated according to an uniform distri- 
bution U/(0, 1) x mean(X)) as initialization for NMF. We use two error measures 
frequently employed for NMF: the Frobenius norm (which will be referred to as 
NMF-F) and the Kullback-Leibler divergence (NMF-KL). Eventually, we compute 
the consensus partition by using the Cluster Ensemble Python package! which 
utilizes the consensus methods defined earlier [25]. 


4.1 Datasets 


We apply NMF on 5 bench-marking document-term matrices for which the 
detailed characteristics are available in Table 1 where nz indicates the percentage 
of values other than 0 and the balance coefficient is defined as the ratio of the 
number of documents in the smallest class to the number of documents in the 
largest class. These datasets highlight several varieties of challenging situations 
such as the amount of clusters, the dimensions, the clusters balance, the degree 
of mixture of the different groups and the sparsity. We normalized each data 
matrix with TF-IDF and their respective documents-vectors to unit D2-norm to 
remove the bias introduced by their length. 


Table 1. Datasets description: # denotes the cardinality 


Datasets Characteristics 

#Documents | # Words | #Clusters | nz(%) | Balance 
CSTR 475 1000 4 3.40 | 0.399 
CLASSIC4 | 7095 5896 4 0.59 = | 0.323 
RCV1 6387 16921 4 0.25 0.080 
NG5 4905 10167 5 0.92 0.943 
NG20 18846 14390 20 0.59 |0.628 


4.2 NMF Raw Performances and Initialization 


The results obtained by NMF-F and NMF-KL according to S-Kmeans and the 
random starting points are available in Table2. The clustering quality of the 


1 https://pypi.org/project /Cluster_Ensembles/. 


176 M. Febrissy and M. Nadif 


S-Kmeans partitions given as entry to both algorithms are also displayed. We 
make use of two relevant measures to quantify and assess the clustering qual- 
ity of each algorithm. The first one is the NMI [25] which quantifies how much 
information the clustering partition shares with the true partition, the second 
is the ARI [14], sensitive to the clusters proportions and measures the degree of 
agreement between the clustering and the true partition. To replicate a relevant 
user experience achieving an unsupervised task, we refer to the criterion of each 
algorithm in order to select the 10 first best solutions (out of 30 runs) and report 
their average NMI and ARI with the true partition. 

One can clearly see that NMF-F and NMF-KL do not react similarly to the 
different initializations. While NMF-F substantially benefits from the S-kmeans 
initialization on every datasets compared to the random initialization, NMF-KL 
does not seem to accommodate S-kmeans entries. In fact, S-Kmeans as starting 
values seems to worsen NMF-KL solutions, especially on CLASSIC4 and NG5. 
For this reason, we will avoid this initialization strategy for NMF-KL in the future 
although it improves on RCV1. Also, NMF-KL with a random initialization pro- 
vides much better results than the other algorithms on almost all datasets. 


Table 2. Mean and standard deviation of NMI and ARI computed over the 10 best 
solutions. 


Datasets Metrics | Skmeans NMF-F (Random) |NMF-F (Skmeans) | NMF-KL (Random) | NMF-KL (Skmeans) 
CSTR NMI 0.76+0.007 | 0.65 + 0.002 0.73 + 0.04 0.73 + 0.03 0.76 + 0.006 
ARI 0.80 + 0.007 | 0.55 + 0.002 0.75 +0.10 0.77 + 0.04 0.80 + 0.006 
CLASSIC4 | NMI 0.60 + 0.001 |0.53 + 0.003 0.59 + 0.002 0.71 +0.02 0.61 + 0.03 
ARI 0.47 + 0.0009 | 0.45 + 0.003 0.47 + 0.002 0.65 + 0.06 0.47 + 0.004 
RCV1 NMI 0.38 + 0.0003 | 0.35 + 0.0005 |0.38+0.0002 |0.47+0.02 0.53 + 0.002 
ARI 0.18 + 0.0004 | 0.13 + 0.0008 |0.18+0.0003 |0.42+0.02 0.46 + 0.02 
NG5 NMI 0.72 + 0.02 0.56 + 1.0e—05 | 0.72 + 0.02 0.80 + 0.03 0.79 + 0.003 
ARI 0.60 + 0.01 0.33 + 2.5e—05 | 0.60 + 0.01 0.82 + 0.04 0.76 + 0.005 
NG20 NMI 0.49 + 0.02 0.41+0.01 0.49 + 0.02 0.48 + 0.02 0.51 + 0.01 
ARI 0.30 + 0.02 0.23 + 0.01 0.30 + 0.02 0.34 + 0.02 0.32 + 0.02 


We reported in Figs. 1, 2, 3 and 4 the clustering quality of the algorithm’s 
solutions ranked from the best one in terms of criterion to the poorest one. The 
respective criterion of each algorithm is normalized to belong to [0, 1]. 

When one does have the real partition, a common practice to evaluate the 
clustering result, one relies on the best solution obtained by optimizing the 
objective function. Figures 1 and 3 highlight a critical behavior of NMF-F which 
tends to produce solutions with the lowest minima that do not fulfil the best 
clustering partitions, sometimes with a substantial gap (see CSTR, RCV1, NG5 
in Fig. 1). Moreover, a surprising lesser but still similar behavior is delivered by 
S-Kmeans which compared to NMF, optimizes a clustering objective by definition. 
The results are displayed in Fig. 2. In reality, this behavior can be observed with 
several types of what we refer to clustering algorithms hosting an optimization 
procedure. Initializing NMF-F randomly as shown in Fig. 3 seems to lighten this 
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Fig. 1. NMF-F: NMI/ARI behaviour according to the objective function F (initializa- 
tions by S-Kmeans) 


effect (on CSTR, Classic4 and RCV1). On another hand, NMF-KL which to this 
day remains recognized as a relevant method for document clustering [13] seems 
to consistently deliver solutions with the lowest criteria aligned with the goodness 
of their clustering, sustaining the use of NMF for clustering purposes. Further- 
more, compared to all, NMF-KL is the only method emphasizing a wide variety 
of solutions and therefore seems to explore way more possibilities than NMF-F 
or S-Kmeans. Its better behavior might almost comfort the idea of selecting the 
best partition in terms of criterion as the one to keep. However, it still fails on 
RCV1 which is the toughest dataset to partition mainly because of its scant 
density. Eventually, it remains concerning to select the best partition just based 
on the fact that, even with NMF-KL, the solution among the best ones providing 
the best clustering, is not necessarily the first one (see on CSTR, CLASSIC4 
and NG5). 

In addition, while the best solutions possibly share a similar amount of infor- 
mation with the true partition, they could be fairly distinct from each other, 
making their use appealing to deduce an even more exhaustive solution. Figure 5 
shows results of pairwise NMI and ARI between the top 10 partitions (criterion- 
wise) of each algorithm. NMF-KL’s best solutions appear to be fairly different 
among each other. 
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Fig. 2. S-Kmeans: NMI/ARI behaviour according to the objective function F (Random 
initializations) 
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Fig. 3. NMF-F: NMI/ARI behaviour according to the objective function F (Random 
initializations) 
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Fig. 4. NMF-KL: NMI/ARI behaviour according to the objective function F (Random 
initializations) 
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Fig. 5. Average pairwise NMI & ARI between top 10 solutions 


4.3 Consensus Clustering 


Following the previous statement, we went ahead and computed a cluster ensem- 
ble (CE) for NMF-F and NMF-KL according to their best initialization strategy as 
well as for S-Kmeans due to its pertinence for initializing NMF-F and the method 
being widely known as relevant for document clustering. The results are reported 
in Table 3. It appears that the consensus obtained with the top 10 results of each 
method generally outperforms the best solution. This result is even stronger for 
NMF-KL where the ensemble clustering increases the NMI and ARI by respec- 
tively 11 and 13 points on NG20. Note that NG20 is the dataset where the 
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average pairwise NMI and ARI between the 10 top partitions are the lowest, 
meaning the most different (see Fig.5). Furthermore, it is interesting to note 
that these performances are obtained from solutions giving an average NMI and 
ARI smaller than the best solution itself. 


Table 3. Mean and standard deviation, first best result and CE consensus computed 
over the 10 best solutions. 


Datasets Metrics | NMF-F (Skmeans) Skmeans NMF-KL (Random) 
Mean + SD (best) | CE Mean + SD (best) | CE Mean + SD | (best) | CE 
CSTR NMI 0.73 + 0.04 (0.65) | (0.76) |0.76 + 0.007 | (0.77) | (0.77) | 0.73 4 0.03 | (0.76) | (0.80) 


ARI |0.75+0.10 | (0.56) | (0.80) | 0.80 0.007 | (0.80) | (0.80) | 0.77 + 0.04 | (0.81) | (0.83) 
CLASSIC4| NMI |0.59+0.002 | (0.59) | (0.59) |0.6040.001 | (0.59) | (0.60) | 0.71 + 0.02 | (0.72) | (0.74) 
ARI |0.47+0.002 | (0.47) | (0.47) | 0.47 + 0.0009 | (0.47) | (0.47) | 0.65 + 0.06 | (0.65) | (0.72) 


RCV1 NMI | 0.38 + 0.0002 | (0.38) | (0.35) | 0.38 + 0.0003 | (0.38) | (0.35) | 0.47 + 0.02 | (0.47) | (0.52) 
ARI |0.18+0.0003 | (0.18) | (0.26) | 0.18 + 0.0004 | (0.18) | (0.26) | 0.42 + 0.02 | (0.43) | (0.46) 
NG5 NMI |0.72+0.02 | (0.74) | (0.76) |0.72+0.02 | (0.73) | (0.75) | 0.80 + 0.03 | (0.83) | (0.86) 
ARI |0.60+0.01 | (0.61) |(0.60)|0.6040.01 | (0.60) | (0.64) |0.82 + 0.04 | (0.85) | (0.88) 
NG20 NMI |0.49+0.02 | (0.51) |(0.50)|0.49+0.02 | (0.51) | (0.50) | 0.48 + 0.02 | (0.50) | (0.61) 


ARI |0.3040.02 | (0.32) | (0.34) |0.3040.02 | (0.32) | (0.34) | 0.34 + 0.02 | (0.36) | (0.49) 


4.4 Consensus Multinomial 


Following the cluster-based consensus approach which implies a similarity- 
based clustering algorithm, we decided to make use of a model-based cluster- 
ing to go and try to obtain a better final partition than the one delivered by 
cluster ensembles. In [26], the authors have used the Multinomial mixture app- 
roach to propose a consensus function. In model-based clustering, it is assumed 
that the data are generated by a mixture of underlying probability distributions, 
where each component k of the mixture represents a cluster. 

Let A € Nj*" be the data matrix of labels vectors from the top r solutions. 
Our data being categorical, we used a Multinomial Mixture Model (MMM) in 
order to partition the elements ;. Categorical data being a generalization of 
binary data; assuming a perfect scenario where there is no partition with an 


empty cluster, a disjunctive matrix M € {0,1}"*"9 is usually used instead of A 
( 


i 


with value m 
(h) 
iq 
Mi sa where aka is the probability that an element m; in the group 
k takes the category h for the partition/variable Aq. The density probability 
function of the model can be stated as: 


f(M;6) =J] Y m e (4) 


where 0 = (m, œ) are the parameters of the model with m = (7,..., 7%) being 
the proportions and æ the vector of the components parameters. 


. where h € {1,...,g} is a cluster label. Therefore, the data values 


m, are assumed to be generated from a Multinomial distribution of parameter 
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Table 4. MMM consensus results over the 10 best solutions 


Datasets Metrics |NMF-KL (Random) 
Mean+SD | (best) | CE MMM 
CSTR NMI 0.73 + 0.03 | (0.76) | (0.80) | (0.77) 
ARI 0.77 + 0.04 | (0.81) | (0.83) | (0.82) 
CLASSIC4 | NMI 0.71 + 0.02 | (0.72) | (0.74) | (0.77) 
ARI 0.65 + 0.06 | (0.65) | (0.72) | (0.75) 
RCV1 NMI 0.47 + 0.02 | (0.47) | (0.52) | (0.52) 
ARI 0.42 + 0.02 | (0.43) | (0.46) | (0.46) 
NG5 NMI 0.80 + 0.03 | (0.83) | (0.86) | (0.86) 
ARI 0.82 + 0.04 | (0.85) | (0.88) | (0.89) 
NG20 NMI 0.48 + 0.02 | (0.50) | (0.61) | (0.63) 
ARI 0.34 + 0.02 | (0.36) | (0.49) | (0.50) 


The Rmixmod package? is used to achieve our analysis. We employ the 
default settings to compute the clustering, allowing the selection between 10 par- 
simonious models according to the Bayesian information Criterion (BIC) [23]. 
With CSTR, the model mainly selected is the one keeping the proportions 7, 
free with model also independent from the variables (labels vectors), mean- 
ing M(m mi ), ak). CSTR is the dataset with the highest pairwise NMI and ARI 
therefore with the most similar best solutions. On CLASSIC4 and RCV1 where 
the pairwise NMI & ARI are a little bit lower, it is the model with free propor- 
tions and parameters œ depending on distinct components and labels vectors 
(M(m\”; alt a which is mainly chosen. On NG5 where the best solutions are 
fairly similar (high pairwise NMI & ARI), it is the model depending on the 
components and the labels vectors which has been retained. However, the pro- 
portions here were kept equal. For NG20 where the best solutions were fairly 
distinct, the model selected is the one depending on the components and the 
variables. As previously, the proportions mę are kept equal. Following the char- 
acteristics in Table 1, it is notable to see that the datasets where the proportions 
are kept equal are actually those with the more balanced real clusters propor- 
tions. The results of the obtained consensus are displayed in Table 4 which only 
retains prior results of NMF-KL top 10 solutions and CE consensus, as they were 
the best overall. Apart from CSTR, we can see that MMM does a better job at 
computing a better partition from the top 10 solutions than CE. 


5 Conclusion 


In this paper, by using cluster ensembles, we have proposed a simple method to 
obtain a better clustering for the scope of NMF algorithms on text data. From its 


? https://cran.r-project.org/web/packages/Rmixmod/Rmixmod.pdf. 
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gathering nature, this process should also alleviate the uncertainty based around 
the overall quality of the final partition compared to other selection practices 
such as keeping an unique solution according to the best criterion. Furthermore, 
we have shown that it was possible to improve the consensus quality through the 
use of finite mixture models, allowing more powerful underlying settings than 
cluster-based consensus involving plain similarities or distances. A future work 
will be to investigate the use of cluster ensembles for other recent clustering 
algorithms [1-3, 19,20]. 
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Abstract. Methods that learn the structure of Probabilistic Senten- 
tial Decision Diagrams (PSDD) from data have achieved state-of-the-art 
performance in tractable learning tasks. These methods learn PSDDs 
incrementally by optimizing the likelihood of the induced probability 
distribution given available data and are thus robust against missing val- 
ues, a relevant trait to address the challenges of embedded applications, 
such as failing sensors and resource constraints. However PSDDs are out- 
performed by discriminatively trained models in classification tasks. In 
this work, we introduce D-LEARNPSDD, a learner that improves the 
classification performance of the LEARNPSDD algorithm by introducing 
a discriminative bias that encodes the conditional relation between the 
class and feature variables. 


Keywords: Probabilistic models - Tractable inference - PSDD 


1 Introduction 


Probabilistic machine learning models have shown to be a well suited approach 
to address the challenges inherent to embedded applications, such as the need 
to handle uncertainty and missing data [11]. Moreover, current efforts in the 
field of Tractable Probabilistic Modeling have been making great strides towards 
successfully balancing the trade-offs between model performance and inference 
efficiency: probabilistic circuits, such as Probabilistic Sentential Decision Dia- 
grams (PSDDs), Sum-Product Networks (SPNs), Arithmetic Circuits (ACs) 
and Cutset Networks, posses myriad desirable properties [4] that make them 
amenable to application scenarios where strict resource budget constraints must 
be met [12]. But these models’ robustness against missing data—from learn- 
ing them generatively—is often at odds with their discriminative capabilities. 
© The Author(s) 2020 
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We address such a conflict by proposing a discriminative-generative probabilis- 
tic circuit learning strategy, which aims to improve the models’ discriminative 
capabilities, while maintaining their robustness against missing features. 

We focus in particular on the PSDD [17], a state-of-the-art tractable rep- 
resentation that encodes a joint probability distribution over a set of random 
variables. Previous work [12] has shown how to learn hardware-efficient PSDDs 
that remain robust to missing data and noise. This approach relies largely on the 
LEARNPSDD algorithm [20], a generative algorithm that incrementally learns 
the structure of a PSDD from data. Moreover, it has been shown how to exploit 
such robustness to trade off resource usage with accuracy. And while the achieved 
accuracy is competitive when compared to Bayesian Network classifiers, dis- 
criminatively learned models perform consistently better than purely generative 
models [21] since the latter remain agnostic to the discriminative task they ought 
to perform. This begs the question of whether the discriminative performance of 
the PSDD could be improved while remaining robust and tractable. 

In this work, we propose a hybrid discriminative-generative PSDD learning 
strategy, D-LEARNPSDD, that enforces the discriminative relationship between 
class and feature variables by capitalizing on the model’s ability to encode 
domain knowledge as a logic formula. We show that this approach consistently 
outperforms the purely generative PSDD and is competitive compared to other 
classifiers, while remaining robust to missing values at test time. 


2 Background 


Notation. Variables are denoted by upper case letters X and their instantiations 
by lower case letters x. Sets of variables are denoted in bold upper case X and 
their joint instantiations in bold lower case x. For the classification task, the 
feature set is denoted by F while the class variable is denoted by C. 


(> 
Pr(Rain) = 0.2, L ai / \ 
i Rain \ 


O.1if Rain i\ 7 \ 
Pr(Sun in) = 2 | 2 2 
(fain) Gun) Pr(Sun | Rain) ee tis 1; JA 


> E f E 7 / 
1 if Rain A Sun [CN AA / / \ 
Pr(Rbow | R, S) = . | [| [| / \ 
0 otherwise T l J \ 


Sun Rbow —Sun .7:Sun ~Rbow Sun Rbow 


(a) Bayes net (b) Conditional probabilities (c) Equivalent PSDD circuit (d) PSDD’s vtree 


Fig. 1. A Bayesian network and its equivalent PSDD (taken from [20]). 
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PSDD. Probabilistic Sentential Decision Diagrams (PSDDs) are circuit repre- 
sentations of joint probability distributions over binary random variables [17]. 
They were introduced as probabilistic extensions to Sentential Decision Dia- 
grams (SDDs) [7], which represent Boolean functions as logical circuits. The 
inner nodes of a PSDD alternate between AND gates with two inputs and OR 
gates with arbitrary number of inputs; the root must be an OR node; and each 
leaf node encodes a distribution over a variable X (see Fig. 1c). The combination 
of an OR gate with its AND gate inputs is referred to as decision node, where 
the left input of the AND gate is called prime (p), and the right is called sub 
(s). Each of the n edges of a decision node are annotated with a normalized 
probability distribution 6), ..., An. 

PSDDs possess two important syntactic restrictions: (1) Each AND node 
must be decomposable, meaning that its input variables must be disjoint. This 
property is enforced by a vtree, a binary tree whose leaves are the random vari- 
ables and which determines how will variables be arranged in primes and subs 
in the PSDD (see Fig. 1d): each internal vtree node is associated with the PSDD 
nodes at the same level, variables appearing in the left subtree X are the primes 
and the ones appearing in the right subtree Y are the subs. (2) Each decision 
node must be deterministic, thus only one of its inputs can be true. 

Each PSDD node q represents a probability distribution. Terminal nodes 
encode a univariate distributions. Decision nodes, when normalized for a vtree 
node with X in its left subtree and Y in its right subtree, encode the following 
distribution over XY (see also Fig. 1a and c): 


Pr¢(X¥) = X` 0:Prp (X) Prs, (Y) (1) 


Thus, each decision node decomposes the distribution into independent distri- 
butions over X and Y. In general, prime and sub variables are independent at 
PSDD node q given the prime base |q] [17]. This base is the support of the node’s 
distribution, over which it defines a non-zero probability and it is written as a 
logical sentence using the recursion [q] = V;[p:] A [si]. Kisa et al. [17] show that 
prime and sub variables are independent in PSDD node q given a prime base: 


Prg(XY|[pi]) = Pry: (Xl lpi) Prs: CY |[p:]) (2) 
= Prp,(X)Prs,(¥) 


This equation encodes context specific independence [2], where variables (or sets 
of variables) are independent given a logical sentence. The structural constraints 
of the PSDD are meant to exploit such independencies, leading to a represen- 
tation that can answer a number of complex queries in polynomial time [1], 
which is not guaranteed when performing inference on Bayesian Networks, as 
they don’t encode and therefore can’t exploit such local structures. 


LearnPSDD. The LEARNPSDD algorithm [20] generatively learns a PSDD by 
maximizing log-likelihood given available data. The algorithm starts by learn- 
ing a vtree that minimizes the mutual information among all possible sets of 
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variables. This vtree is then used to guide the PSDD structure learning stage, 
which relies on the iterative application of the Split and Clone operations [20]. 
These operations keep the PSDD syntactically sound while improving likelihood 
of the distribution represented by the PSDD. A problem with LEARNPSDD 
when using the resulting model for classification is that when the class variable 
is only weakly dependent on the features, the learner may choose to ignore that 
dependency, potentially rendering the model unfit for classification tasks. 


3 A Discriminative Bias for PSDD Learning 


Generative learners such as LEARNPSDD optimize the likelihood of the distribu- 
tion given available data rather than the conditional likelihood of the class vari- 
able C given a full set of feature variables F. As a result, their accuracy is often 
worse than that of simple models such as Naive Bayes (NB), and its close relative 
Tree Augmented Naive Bayes (TANB) [12], which perform surprisingly well on 
classification tasks even though they encode a simple—or naive—structure [10]. 
One of the main reasons for their performance, despite being generative, is that 
(TA)NB models have a discriminative bias that directly encodes the conditional 
dependence of all the features on the class variable. 

We introduce D-LEARNPSDD, an extension to LEARNPSDD based on the 
insight that the learned model should satisfy the “class conditional constraint” 
present in Bayesian Network classifiers. That is, all feature variables must be 
conditioned on the class variable. This enforces a structure that is beneficial for 
classification while still allowing to generatively learn a PSDD that encodes the 
distribution over all variables using a state-of-the-art learning strategy [20]. 


3.1 Discriminative Bias 
The classification task can be stated as a probabilistic query: 
Pr(C|F) ~ Pr(F|C) - Pr(C). (3) 


Our goal is to learn a PSDD whose root decision node directly represents the 
conditional probability distribution Pr(F|C). This can be achieved by forcing 
the primes of the first line in Eq.2 to be [po] = [~c] and [pı] = [c], where [c] 
states that the propositional variable c representing the class variable is true 
(i.e. C = 1), and similarly [~c] represents C = 0. For now we assume the class is 
binary and will show later how to generalize to a multi-valued class variable. For 
the feature variables we can assume they are binary without loss of generality 
since a multi-valued variable can be converted to a set of binary variables via a 
one-hot encoding (see, for example [20]). To achieve our goal we first need the 
following proposition: 


Proposition 1. Given (i) a vtree with a single variable C as the prime and 
variables F as the sub of the root node, and (ii) an initial PSDD where the 
root decision node decomposes the distribution as [root] = ([po] A [So]) V ([pi] A 
[s1]); applying the Split and Clone operators will never change the root decision 
decomposition [root] = ([po] A [So]) V ([pi] A [s1])- 
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Proof. The D-LEARNPSDD algorithm iteratively applies two operations: Clone 
and Split (following the algorithm in [20]). First, the Clone operator requires a 
parent node, which is not available for the root node. Since the initial PSDD 
follows the logical formula described above, whose only restriction is on the root 
node, there is no parent available to clone and the root’s base thus remains intact 
when applying the Clone operator. Second, the Split operator splits one of the 
subs to extend the sentence that is used to mutually exclusively and exhaustively 
define all children. Since the given vtree has only one variable, C, as the prime 
of the root node, there are no other variables available to add to the sub. The 
Split operator cant thus not be applied anymore and the root’s base stays intact 
(see Figs. 1c and d). 


We can now show that the resulting PSDD contains nodes that directly 
represent the distribution Pr(F|C). 


Proposition 2. A PSDD of the form [root] = ([~c] A [so]) V ([c] A [s1]) with c 
the propositional variable stating that the class variable is true, and so and sı 
any formula with propositional feature variables fo,..., fn, directly expresses the 
distribution Pr(F|C). 


Proof. Applying this to Eq. 1 results in: 


Prg(CF) = Pri-(C)Prs,(F) + Pre(C)Prs, (F) 
= Pr-e(C|[ 0) + Pre (File) + Pre(Clla) « Pr, (File) 
= Pr_.(C = 0) - Pr,,(F|C = 0) + Pr.(C = 1) - Pry, (F|C = 1) 


The learned PSDD thus contains a node sg with distribution Prs, that 
directly represents Pr(F|C = 0) and a node sı with distribution Prs, that rep- 
resents Pr(F|C = 1). The PSDD thus encodes Pr(F|C) directly because the two 
possible value assignments of C are C = 0 and C = 1. 


The following examples illustrate why both the specific vtree and initial 
PSDD are required. 


Example 1. Figure 2b shows a PSDD that encodes a fully factorized probability 
distribution normalized for the vtree in Fig. 2a. The PSDD shown in this example 
initializes the incremental learning procedure of LEARNPSDD [20]. Note that 
the vtree does not connect the class variable C' to all feature variables (e.g. 
F). Therefore, when initializing the algorithm on this vtree-PSDD combination, 
there are no guarantees that the conditional relations between certain features 
and the class will be learned. 


Example 2. Figure 2e shows a PSDD that explicitly conditions the feature vari- 
ables on the class variables by normalizing for the vtree in Fig.2c and by fol- 
lowing the logical formula from Proposition 2. This biased PSDD is then used to 
initialize the D-LEARNPSDD learner. Note that the vtree in Fig. 2c forces the 
prime of the root node to be the class variable C. 
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Example 3. Figure 2d shows, however, that only setting the vtree in Fig. 2c is 
not sufficient for the learner to condition the features on the class. When initial- 
izing on a PSDD that encodes a fully factorized formula, and then applying the 
Split and Clone operators, the relationship between the class variable and the 
features are not guaranteed to be learned. In this worst case scenario, the learned 
model could have an even worse performance than the case from Example 1. By 
applying Eq. 1 on the top split, we can give intuition why this is the case: 


Prq(CF) = Pry (Cle V =d) - Pro, Flle V =d) 
= (Prp, (C|[c]) + Prp, (C|[>c])) - Prso (Fle V =¢]) 
= (Prp, (C = 1) + Prp (C = 0)) - Prs (F) 


The PSDD thus encodes a distribution that assumes that the class variable is 
independent from all feature variables. While this model might still have a high 
likelihood, its classification accuracy will be low. 


We have so far introduced the D-LEARNPSDD for a binary classification 
task. However, it can be easily generalized to an n-valued classification scenario: 
(1) The class variable C will be represented by multiple propositional variables 
Co,C1,---,Cn that represent the set C = 0,C = 1,...,C = n, of which exactly 
one will be true at all times. (2) The vtree in Proposition 1 now starts as a 
right-linear tree over co,...,Cn. The F variables are the sub of the node that 
has c, as prime. (3) The initial PSDD in Proposition2 now has a root the 
form [root] = V jo. n (lci Aj.0...naieéj “Ci! A [Sé]), which remains the same after 
applying Split and Clone. The root decision node now represents the distribution 
Pra(CE) = X 4.0,.n Prc OALE = i) - Pr, (F|C = i) and therefore has nodes 
at the top of the tree that directly represent the discriminative bias. 


3.2 Generative Bias 


Learning the distribution over the feature variables is a generative learning pro- 
cess and we can achieve this by applying the Split and Clone operators in the 
same way as the original LEARNPSDD algorithm. In the previous section we had 
not yet defined how should Pr(F|C) from Proposition 2 be represented in the ini- 
tial PSDD, we only explained how our constraint enforces it. So the question is 
how do we exactly define the nodes corresponding to sg and sı with distribu- 
tions Pr(F|C = 0) and Pr(F|C = 1)? We follow the intuition behind (TA)NB 
and start with a PSDD that encodes a distribution where all feature variables 
are independent given the class variable (see Fig. 2e). Next, the LEARNPSDD 
algorithm will incrementally learn the relations between the feature variables by 
applying the Split and Clone operations following the approach in [20]. 


3.3 Obtaining the Vtree 


In LEARNPSDD, the decision nodes decompose the distribution into independent 
distributions. Thus, the vtree is learned from data by maximizing the approxi- 
mate pairwise mutual information, as this metric quantifies the level of indepen- 
dence between two sets of variables. For D-LEARNPSDD we are interested in 
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the level of conditional independence between sets of feature variables given the 
class variable. We thus obtain the vtree by optimizing for Conditional Mutual 
Information instead and replace mutual information in the approach in [20] with: 


CMI1(X, Y|Z) = £, Ny Dz Pr(xy) log wee 


Th BoR fht, Fy Fy 


e) 


Fig. 2. Examples of vtrees and initial PSDDs. 


4 Experiments 


We compare the performance of D-LEARNPSDD, 
LEARNPSDD, two generative Bayesian classifiers 
(NB and TANB) and a discriminative classifier 
(logistic regression). In particular, we discuss the 
following research queries: (1) Sect.4.2 examines 
whether the introduced discriminative bias improves 
classification performance on PSDDs. (2) Sect. 4.3 
analyzes the impact of the vtree and the imposed 
structural constraints on model tractability and 
performance. (3) Finally, Sect.4.4 compares the 
robustness to missing values for all classification 
approaches. 


1 £3 Fs Fo Fy Fy Fy 


Table 1. Datasets 


Dataset IFT CIIN] 

Australian} 40 |2 690 
Breast 28 |2 683 
Chess 39 |2 |3196 
Cleve 25 |2 303 
Corral 6 |2 160 
Credit 42 |2 653 
Diabetes 11 |2 768 
German 54 |2 |1000 
Glass 17 |6 214 
Heart 9/2 270 
Iris 12 |3 150 
Mofn 10 }2 |1324 
Pima 11 |2 768 
Vehicle 57 |2 846 
Waveform |109 |3 |5000 
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4.1 Setup 


We ran our experiments on the suite of 15 standard machine learning bench- 
marks listed in Table 1. All of the datasets come from the UCI machine learning 
repository [8], with exception of “Mofn” and “Corral” [18]. As pre-processing 
steps, we applied the discretization method described in [9], and we binarized all 
variables using a one-hot encoding. Moreover, we removed instances with miss- 
ing values and features whose value was always equal to 0. Table 1 summarizes 
the number of binary features |F|, the number of classes |C| and the available 
number of training samples |N| per dataset. 


4.2 Evaluation of DG-LearnPSDD 


Table 2 compares D-LEARNPSDD, LEARNPSDD, Naive Bayes (NB), Tree Aug- 
mented Naive Bayes (TANB) and logistic regression (LogReg)! in terms of accu- 
racy via five fold cross validation?. For LEARNPSDD, we incrementally learned a 
model on each fold until convergence on validation-data log-likelihood, following 
the methodology in [20]. 

For D-LEARNPSDD, we incrementally learned a model on each fold until 
likelihood converged but then selected the incremental model with the highest 
training set accuracy. For NB and TANB, we learned a model per fold and 
compiled them to Arithmetic Circuits’, a more general form of PSDDs [6], which 
allows us to compare the size of these Bayes net classifiers and the PSDDs. 
Finally, we compare all probabilistic models with a discriminative classifier, a 
multinomial logistic regression model with a ridge estimator. 

Table 2 shows that the proposed D-LEARNPSDD clearly benefits from the 
introduced discriminative bias, outperforming LEARNPSDD in all but two 
datasets, as the latter method is not guaranteed to learn significant relations 
between feature and class variables. Moreover, it outperforms Bayesian classi- 
fiers in most benchmarks, as the learned PSDDs are more expressive and allow 
to encode complex relationships among sets of variables or local dependencies 
such as context specific independence, while remaining tractable. Finally, note 
that the D-LEARNPSDD is competitive in terms of accuracy with respect to 
logistic regression (LogReg) a purely discriminative classification approach. 


4.3 Impact of the Vtree on Discriminative Performance 


The structure and size of the learned PSDD is largely determined by the vtree it 
is normalized for. Naturally, the vtree also has an important role in determining 
the quality (in terms of log-likelihood) of the probability distribution encoded 
by the learned PSDD [20]. In this section, we study the impact that the choice 
of vtree and learning strategy has on the trade-offs between model tractability, 
quality and discriminative performance. 


1 NB, TANB and LogReg are learned using Weka with default settings. 
? In each fold, we hold 10% of the data for validation. 
3 Using the ACE tool Available at http://reasoning.cs.ucla.edu/ace/. 
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Table 2. Five cross fold accuracy and size in number of parameters 


Dataset D-LearnPSDD LearnPSDD NB TANB LogReg 
Accuracy |Size |Accuracy |Size | Accuracy |Size|Accuracy |Size | Accuracy 
Australian | 86.2 + 3.6) 367 |84.9 + 2.7 386 |85.1 + 3.1 |161 |85.8 + 3.4 | 312 |84.1 + 3.4 
Breast 97.1 + 0.9 | 291 |94.9 + 0.5 491 (97.7 + 1.2 |114 |97.7 + 1.2 219 |96.5 + 1.6 
Chess 97.3 + 1.4|2178 |94.9 + 1.6 |2186 |87.7 + 1.4 |158 |91.7 + 2.2 | 309 |96.9 + 0.7 
Cleve 82.2 + 2.5 292 |81.9 £3.2 184 | 84.9 + 3.3 |102 |79.9+ 2.2 196 |81.5 + 2.9 
Corral 6 |99.4+1.4| 39 |98.1 + 2.8 58 |89.4+ 5.2 | 26 |98.8 + 1.7 45 |86.3 + 6.7 
Credit 85.6 +3.1 693 |86.1 +3.6 | 611 |86.8 + 4.4 |170 |86.1 + 3.9 | 326 |84.7 + 4.9 


Diabetes |78.7 +2.9| 124 |77.2 + 3.3 144 77.4 + 2.56| 46 |75.8+ 3.5 86 |78.4 + 2.6 
German 72.3 + 3.2 |1185 |69.9 + 2.3 645 73.5 2.7 |218 |74.5 +1.9| 429 |74.4 + 2.3 


Glass 79.1 +1.9| 214 |72.4 + 6.2 321 70.0 + 4.9 |203 |69.5 5.2 | 318 |73.0+ 5.7 
Heart 84.1 + 4.3 51 |78.5 + 5.3 75 84.0 + 3.8 38 |83.0+ 5.1 70 | 84.0 + 4.7 
Iris 90.0 + 0.1 76 |94.0 + 3.7 158 94.7 +1.8| 75 |94.7 1.8 | 131 |94.74 2.9 
Mofn 98.9+ 0.9 | 260 |97.1 + 2.4 260 85.0 5.7 | 42 |92.8 + 2.6 78 |100.0+0 
Pima 80.2 + 0.3| 108 |74.7 + 3.2 110 77.6 3.0 | 46 |76.3 2.9 86 |77.7 + 2.9 


Vehicle 95.0 + 1.7/ 1186 |93.9 + 1.69| 1560 |86.3 + 2.00 |228 |93.0 + 0.8 | 442 |94.5 + 2.4 
Waveform |85.0 + 1.0 |3441 |78.7 + 5.6 |2585 80.7 + 1.9 |657 |83.1 + 1.1 |1296 |85.5 + 0.7 


Figure 3a shows test-set log-likelihood and Fig. 3b classification accuracy as a 
function of model size (in number of parameters) for the “Chess” dataset. We dis- 
play average log-likelihood and accuracy over logarithmically distributed ranges 
of model size. This figure contrasts the results of three learning approaches: D- 
LEARNPSDD when the vtree learning stage optimizes mutual information (MI, 
shown in light blue); when it optimizes conditional mutual information (CMI, 
shown in dark blue); and the traditional LEARNPSDD (in orange). 

Figure 3a shows that likelihood improves at a faster rate during the first 
iterations of LEARNPSDD, but eventually settles to the same values as D- 
LEARNPSDD because both optimize for log-likelihood. However, the discrimi- 
native bias guarantees that classification accuracy on the initial model will be 
at least as high as that of a Naive Bayes classifier (see Fig. 3b). Moreover, this 
results in consistently superior accuracy (for the CMI case) compared to the 
purely generative LEARNPSDD approach as shown also in Table 2. The dip in 
accuracy during the second and third intervals are a consequence of the genera- 
tive learning, which optimizes for log-likelihood and can therefore initially yield 
feature-value correlations that decrease the model’s performance as a classifier. 

Finally, Fig. 3b demonstrates that optimizing the vtree for conditional mutual 
information results in an overall better performance vs. accuracy trade-off when 
compared to optimizing for mutual information. Such a conditional mutual infor- 
mation objective function is consistent with the conditional independence con- 
straint we impose on the structure of the PSDD and allows the model to consider 
the special status of the class variable in the discriminative task. 
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Fig. 3. Log-likelihood and accuracy vs. model size trade-off of the incremental PSDD 
learning approaches. MI and CMI denote mutual information and conditional mutual 
information vtree learning, respectively. (Color figure online) 


4.4 Robustness to Missing Features 


The generative models in this paper encode a joint probability distribution over 
all variables and therefore tend to be more robust against missing features than 
discriminative models, which only learn relations relevant to their discriminative 
task. In this experiment, we assessed this robustness aspect by simulating the 
random failure of 10% of the original feature set per benchmark and per fold 
in five-fold cross-validation. Figure 4 shows the average accuracy over 10 such 
feature failure trials in each of the 5 folds (flat markers) in relation to their full 
feature set accuracy reported in Table 2 (shaped markers). As expected, the per- 
formance of the discriminative classifier (LogReg) suffers the most during feature 
failure, while D-LEARNPSDD and LEARNPSDD are notably more robust than 
any other approach, with accuracy losses of no more than 8%. Note from the 
flat markers that the performance of D-LEARNPSDD under feature failure is 
the best in all datasets but one. 
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Fig. 4. Classification robustness per method. 
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5 Related Work 


A number of works have dealt with the conflict between generative and dis- 
criminative model learning, some dating back decades [14]. There are multiple 
techniques that support learning of parameters [13,23] and structure [21,24] 
of probabilistic circuits. Typically, different approaches are followed to either 
learn generative or discriminative tasks, but some methods exploit discrimina- 
tive models’ properties to deal with missing variables [22]. Other works that also 
constraint the structure of PSDDs have been proposed before, such as Choi et 
al. [3]. However, they only do parameter learning, not structure learning: their 
approach to improve accuracy is to learn separate structured PSDDs for each 
distribution of features given the class and feed them to a NB classifier. In [5], 
Correira and de Campos propose a constrained SPN architecture that shows both 
computational efficiency and classification performance improvements. However, 
it focuses on decision robustness rather than robustness against missing values, 
essential to the application range discussed in this paper. There are also a num- 
ber of methods that focus specifically on the interaction between discriminative 
and generative learning. In [15], Khosravi et al. provide a method to compute 
expected predictions of a discriminative model with respect to a probability dis- 
tribution defined by an arbitrary generative model in a tractable manner. This 
combination allows to handle missing values using discriminative couterparts of 
generative classifiers [16]. More distant to this work is the line of hybrid discrim- 
inative and generative models [19], their focus is on semisupervised learning and 
deals with missing labels. 


6 Conclusion 


This paper introduces a PSDD learning technique that improves classification 
performance by introducing a discriminative bias. Meanwhile, robustness against 
missing data is kept by exploiting generative learning. The method capitalizes 
on PSDDs’ domain knowledge encoding capabilities to enforce the conditional 
relation between the class and the features. We prove that this constraint is 
guaranteed to be enforced throughout the learning process and we show how not 
encoding such a relation might lead to poor classification performance. Evalu- 
ation on a suite of benchmarking datasets shows that the proposed technique 
outperforms purely generative PSDDs in terms of classification accuracy and the 
other baseline classifiers in terms of robustness. 
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Abstract. Signature patterns have been introduced to model repetitive 
behavior, e.g., of customers repeatedly buying the same set of products 
in consecutive time periods. A disadvantage of existing approaches to 
signature discovery, however, is that the required number of occurrences 
of a signature needs to be manually chosen. To address this limitation, we 
formalize the problem of selecting the best signature using the minimum 
description length (MDL) principle. To this end, we propose an encoding 
for signature models and for any data stream given such a signature 
model. As finding the MDL-optimal solution is unfeasible, we propose a 
novel algorithm that is an instance of widening, i.e., a diversified beam 
search that heuristically explores promising parts of the search space. 
Finally, we demonstrate the effectiveness of the problem formalization 
and the algorithm on a real-world retail dataset, and show that our 
approach yields relevant signatures. 


Keywords: Signature discovery - Minimum description length - 
Widening 


1 Introduction 


When analyzing (human) activity logs, it is especially important to discover 
recurrent behavior. Recurrent behavior can indicate, for example, personal pref- 
erences or habits, and can be useful in contexts such as personalized market- 
ing. Some types of behavior are elusive to traditional data mining methods: for 
example, behavior that has some temporal regularity but not strong enough to 
be periodic, and which does not form simple itemsets or sequences in the log. A 
prime example is the set of products that is essential to a retail customer: all of 
these products are bought regularly, but often not periodically due to different 
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depletion rates, and they are typically bought over several transactions—in any 
arbitrary order—rather than all at the same time. 

To model and detect such behavior, we have proposed signature patterns [3]: 
patterns that identify irregular recurrences in an event sequence by segmenting 
the sequence (see Fig. 1). We have shown the relevance of signature patterns in 
the retail context, and demonstrated that they are general enough to be used in 
other domains, such as political speeches [2]. As a disadvantage, however, signa- 
ture patterns require the analyst to provide the number of recurrences, i.e., the 
number of segments in the segmentation. This number of segments influences the 
signature: fewer segments give a more detailed signature, while more segments 
result in a simpler signature. Although in some cases domain experts may have 
some intuition on how to choose the number of segments, it is often difficult to 
decide on a good trade-off between the number of segments and the complexity of 
the signature. The main problem that we study in this paper is therefore how to 
automatically set this parameter in a principled way, based on the data. 

Our first main contribution is a problem formalization that defines the best 
signature for a given dataset, so that the analyst no longer needs to choose the 
number of segments. By considering the signature corresponding to each possible 
number of segments as a model, we can naturally formulate the problem of select- 
ing the best signature as a model selection problem. We formalize this problem 
using the minimum description length (MDL) principle [4], which, informally, 
states that the best model is the one that compresses the data best. The MDL 
principle perfectly fits our purposes because (1) it allows to select the simplest 
model that adequately explains the data, and (2) it has been previously shown 
to be very effective for the selection of pattern-based models (e.g., [7,11]). 

After defining the problem using the MDL principle, the remaining question 
is how to solve it. As the search space of signatures is extremely large and the 
MDL-based problem formulation does not offer any properties that could be used 
to substantially prune the search space, we resort to heuristic search. Also here, 
the properties of signature patterns lead to technical challenges. In particular, 
we empirically show that a naive beam search often gets stuck in suboptimal 
solutions. Our second main contribution is therefore to propose a diverse beam 
search algorithm, i.e., an instance of widening [9], that ensures that a diverse set 
of candidate solutions is maintained on each level of the beam search. For this, 
we define a distance measure for signatures based on their segmentations. 


2 Preliminaries 


Sequencea ac _ a,b,d b a a,b a,b,c,e 

TransactionsTıT2 T3 T4 Ts Te T7 

Hit 
SI Fa ff M 


Segments 


Fig. 1. A sequence of transactions and a 4-segmentation. We have the signature items 
R = {a,b}, the remaining items E = {c,d,e}, the set of items Z = {a,b,c,d,e}, the 
segmentation S = ([Tı, T2, T3], [T4, Ts], [Te], [T7]}. 
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Signatures. Let us first recall the definition of a signature as presented in [3]. 
Let Z be the set of all items, and let a = (T; ... Tn}, Ti C T be a sequence of 
itemsets. A k-segmentation of a, denoted S(a, k) = (S1 . . . Sk), is a sequence of k 
non-overlapping consecutive sub-sequences of a, denoted S; and called segments, 
each consisting of consecutive transactions. An example of a 4-segmentation is 
given in Fig.1. Given S(a,k) = (S1...S,), a k-segmentation of a, we have 
Rec(S(a,k)) = (g,e5(a,n)(Ur,es, Tj): the set of all recurrent items that are 
present in each segment of S(a,k). For example in Fig.1, the segmentation 
S(a,4) = (S1, S2, S3, S4) gives Rec(S(a,4)) = {a,b}. Given k and a, one 
can compute Smazr(a, k), the set of k-segmentation of a yielding the largest 
sets of recurrent items: Smac(a,k) = argmaXs(a k) |Rec(S(a,k))|. For exam- 
ple, in Fig. 4, ($1, S2, S3, S4) is the only 4-segmentation yielding two recurrent 
items. As all other 4-segmentations either yield zero or one recurrent item, 
Smax(a,4) = {(S1, S2, S3, 54)}. A k-signature (also named signature when k 
is clear from context) is then defined as a maximal set of recurrent items in a k- 
segmentation S, with S € Smazr(a,k). AS Smax(a, k) can contain several segmen- 
tations, we define the k-signature set Sig(a,k), which contains all k-signatures: 
Sigla, k) = {Rec(Sim(a,k)) | Sm E Smax(a,k)}. k gives the number of recur- 
rences of the recurrent items in sequence a. Given a number of recurrences k, 
finding a k-signature relies on finding a k-segmentation that maximizes the size 
of the itemset that occurs in each segment of that segmentation. For example, in 
Fig. 1, given segmentation S = (S1, S2, 83, S4) and given that Simaz(a,4) = {S}, 
we have Sig(a,4) = {Rec(S)} = {{a, b}}. For simplicity, the segmentation asso- 
ciated with a k-signature in Sig(a,k) is denoted S = (S1... Sk), and the signa- 
ture items are denoted R C ZT. The remaining items are denoted €, i.e., E = I\R. 


Minimum Description Length (MDL). Let us now briefly introduce the basic 
notions of the minimum description length (MDL) principle [4] as it is commonly 
used in compression-based pattern mining [7]. Given a set of models M and 
a dataset D, the best model M € M is the one that minimizes L(D, M) = 
L(M) + L(D|M), with L(M) the length, in bits, of the encoding of M, and 
L(D|M) the length, in bits, of the encoding of the data given M. This is called 
two-part MDL because it separately encodes the model and the data given the 
model, which results in a natural trade-off between model complexity and data 
complexity. To fairly compare all models, the encoding has to be lossless. To use 
the MDL principle for model selection, the model class M has to be defined (in 
our case, the set of all signatures), as well as how to compute the length of the 
model and the length of the data given the model. It should be noted that only 
the encoded length of the data is of interest, not the encoded data itself. 


3 Problem Definition 


To extract recurrent items from a sequence using signatures, one must define the 
number of segments k. Providing meaningful values for k usually requires expert 
knowledge and/or many tryouts, as there is no general rule to automatically set 
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k. Our problem is therefore to devise a method that adjusts k, depending on the 
data at hand. As this is a typical model selection problem, our approach relies 
on the minimum description length principle (MDL) to find the best model from 
a set of candidate models. However, the signature model must be refined into a 
probabilistic model to use the MDL principle for model selection. Especially, the 
occurrences of items in a should be defined according to a probability distribu- 
tion. With no information about these occurrences, the uniform distribution is 
the most natural choice. Indeed, without information on the transaction in which 
an item occurs, the best is to assume it can occur uniformly at random in any 
transaction of the sequence a. Moreover, the choice of the uniform distribution 
has been shown to minimize the worst case description length [4]. 

To make the signature model probabilistic, we assume that it generates three 
different types of occurrences independently and uniformly. As the signature 
gives the information that there is at least one occurrence of every signature 
item in every segment, the first type of occurrences correspond to this one occur- 
rence of signature items in every segment. These are generated uniformly over 
all the transactions of every segment. The second type of occurrences are the 
remaining signature items occurrences. Here, the information is that these items 
already have occurrences generated by the previous type of occurrences. As œ is 
a sequence of itemsets, an item can occur at most once in a transaction. Hence, 
for a given signature item, the second type of occurrences for this item are dis- 
tributed uniformly over the transactions where this item does not already occur 
for the first type of occurrences. Finally, the third type are the occurrences of the 
remaining items: the items that are not part of the signature. There is no infor- 
mation about these items occurrences, hence we assume them to be generated 
uniformly over all transactions of a. 

With these three types of occurrences, the signature model is probabilistic: all 
occurrences in a are generated according to a probability distribution that takes 
into account the information provided by the signature specification. Hence, we 
can now define the problem we are tackling: 


Problem 1. Let S denote the set of signatures for all values of k, S = 

lal, Sigla, k). Given a sequence a, it follows from the MDL principle that 
the best signature S € S is the one that minimizes the two-part encoded length 
of S and a, i.e., 


SMDL = argmin ges L(a, S), 


where L(a, S) is the two-part encoded length that we present in the next section. 


4 An Encoding for Signatures 


As typically done in compression-based pattern mining |7], we use a two-part 
MDL code that leads to decomposing the total encoded length L(a, S) into two 
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parts: L(S) and L(a|S), with the relation L(a, S) = L(S) + L(al|S). In the 
upcoming subsection we define L(S), i.e., the encoded length of a signature, 
after which Subsect.4.2 introduces L(a|S), i.e., the length of the sequence a 
given a signature S. In the remainder of this paper, all logarithms are in base 2. 


4.1 Model Encoding: L(S) 


A signature is composed of two parts: (1) the signature items, and (2) the sig- 
nature segmentation. The two parts are detailed below. 


Signature Items Encoding. The encoding of the signature items consists of 
three parts. The signature items are a subset of Z, hence we first encode the 
number of items in Z. A common way to encode non-negative integer numbers 
is to use the universal code for integers [4,8], denoted Lyt. This yields a code 
of size Ly(|Z|). Next, we encode the number of items in the signature, using 
again the universal code for integers, with length Ly(|R]). Finally, we encode 
the items of the signature. As the order of signature items is irrelevant, we can 
use an ae combination of |Z| elements without replacement. This yields a length 
of logih foi) From R and Z, we can deduce £. 


Segmentation Encoding. We now present the encoding of the second part 
of the signature: the signature segmentation. To encode the segmentation, we 
encode the segment boundaries. These boundaries are indexed on the size of the 
sequence, hence we first need to encode the number of transactions n. This can be 
done using again the universal code for integers, which is of size Ly(n). Then, we 
need to encode the number of segments |S], which is of length Ly(| S|). To encode 
the segments, we only have to encode the boundaries between two consecutive 
segments. As there are |S|—1 such boundaries, a naive encoded length would be 
(|S|—1)*log(n). An improved encoding takes into account the previous segments. 
For example, when encoding the second boundary, we know that its value will 
not be higher than n — |S1|. Hence, we can encode it in log(n — |S1|) instead of 
log(n) bits. This principle can be applied to encode all boundaries. Another way 
to further reduce the encoded length is to use the fact that we know that each 
signature segment contains at least one transaction. We can therefore subtract 
the number of remaining segments to encode the boundary of the segment we are 
encoding. This yields an encoded length of ye ‘log(n — (|S| — i) — yea |S; |). 


Putting Everything Together. The total encoded length of a signature S is 


LS) = Ex(|Z)) + La( IRI) Hogt (fe) )) + 


|S|—1 


t—1 
Ly(n) + Ly(|S|) + > log(n — (|S| — i) — 5-3). 
j=l 


' Ly = log*(n) + log(2.865064), with log*(n) = log(n) + log(log(n)) + 


202 C. Gautrais et al. 


Sig : 1% occ 


Sig : other occ j ° a ae aD 
Other items occ c d c,e 
Sequencea ac _ a,b,d b a a,b a,b,c,e 


TransactionsTıT2 T3 T4 Ts Te Tz 
H 


Segments S1 | s2 bd J 
Fig. 2. A sequence of transactions and its encoding scheme. We have R = {a,b}, 


E = {c,d,e} and T = {a,b,c,d,e}. The first occurrence of each signature item in each 
segment is encoded in the red stream, the remaining signature items occurrences in the 
orange stream, and the items from € in the blue stream. (Color figure online) 


4.2 Data Encoding: L(a|S) 


We now present the encoding of the sequence given the model: L(a|S). This 
encoding relies on the refinement of the signature model into a probabilistic 
model presented in Sect.3. To summarize, we have three separate encoding 
streams that encode the three different types of occurrences presented in Sect. 3: 
(1) one that encodes one occurrence of every signature item in every segment, 
(2) one that encodes the rest of the signature items occurrences, and (3) one 
that encodes the remaining items occurrences. An example illustrating the three 
different encoding streams is presented in Fig. 2. 


Encoding One Occurrence of Each Signature Item in Each Segment. 
As stated in Sect.3, the signature says that in each segment, there is at least 
one occurrence of each signature item. The size of each segment is known (from 
the encoding of the model, in Subsect. 4.1), hence we encode one occurrence of 
each signature item in segment S; by encoding the index of the transaction, 
within segment S;, that contains this occurrence. From Sect. 3, this occurrence 
is uniformly distributed over the transactions in S;. As encoding an index over 
|S;| equiprobable possibilities costs log(|S;|) bits and as in each segment, |R] 
occurrences are encoded this way, we encode each segment in |R| * log(|S;|) bits. 


Encoding the Remaining Signature Items’ Occurrences. As presented 
in Fig.2, we now encode remaining signature items occurrences to guarantee 
a lossless encoding. Again, this encoding relies on encoding transactions where 
signature items occur. For each item a, we encode its occurrences occ(a) = 
J T,ca per, La=p by encoding to which transaction it belongs. As S occur- 
rences have already been encoded using the previous stream, there are occ(a)—|S| 
remaining occurrences to encode. These occurrences can be in any of the n — |S| 
remaining transactions. From Sect.3, we use a uniform distribution to encode 
them. More precisely, the first occurrence of item a can belong to any of the n—|S| 
transactions where a does not already occur. For the second occurrence of a, there 
are now only n—|S|—1 transactions where a can occur. By applying this principle, 
we encode all the remaining occurrences of a as 3°18!“ Jog(n—|S| —i). For 
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each item, we also use Ly(occ(a) — |S]|) bits to encode the number of occurrences. 
This yields a total length of X` „er Ln(occ(a)— [Sie is log(n—|S|—1). 


Remaining Items Occurrences Encoding. Finally, we encode the remaining 
items occurrences, i.e., the occurrences of items in €. The encoding technique 
is identical to the one used to encode additional signature items occurrences, 
with the exception that the remaining items occurrences can initially be present 
in any of the n transactions. This yields a total code of ` eg Ln(occ(a)) + 


Deel) lop (n — i). 


Putting Everything Together. The total encoded length of the data given the 
model is given by: L(a|S) = Xs es IRI * log(|Sil) + Xacer En(oce(a) — |S]) + 


Eral- log(n — |S] - i) + Dace Lnlocela)) + DIF” log(n — i). 


5 Algorithms 


The previous section presented how a sequence is encoded, completing our prob- 
lem formalization. The remaining problem is to find the signature minimizing 
the code length, that is, finding Smpz such that Sypr = argminges L(a, S). 


Naive Algorithm. A naive approach would be to directly mine the whole set 
of signatures S and find the signature that minimizes the code length. However, 
mining a signature with k segments has time complexity O(n?k). Mining the 
whole set of signatures requires k to vary from 1 to n, resulting in a total com- 
plexity of O(n*). The quartic complexity does not allow us to quickly mine the 
complete set of possible signatures on large datasets, hence we have to rely on 
heuristic approaches. 

To quickly search for the signature in S that minimizes the code length, we 
initially rely on a top-down greedy algorithm. We start with one segment con- 
taining the whole sequence, and then search for the segment boundary that min- 
imizes the encoded length. Then, we recursively search for a new single segment 
boundary that minimizes the encoded length. We stop when no segment can 
be added, i.e., when the number of segments is equal to the number of transac- 
tions. During this process, we record the signature with the best encoded length. 
However, this algorithm can perform early segment splits that seem promising 
initially, but that eventually impair the search for the best signature. 


5.1 Widening for Signatures 


To solve this issue, a solution is to keep the w signatures with the lowest code 
length at each step instead of keeping only the best one. This technique is called 
beam search and has been used to tackle optimization problems in pattern mining 
[6]. The beam width w is the number of solutions to keep at each step of the 
algorithm. However, the beam search technique suffers from having many of the 
best w signatures that tend to be similar and correspond to slight variations 
of one signature. Here, this means that most signatures in the beam would 
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Algorithm 1. Widening algorithm for signature code length minimization. 

1: function SIGNATURE MINING(a@ = (Ti, ..., Tn), 8, w) 
BestKSign = Ø, BestSign = 0 
3 for k =1—ndo 
4: AllKSign = Split1Segment(BestKSign) 
5: Sopt = argminge anr sign L(&, S) 
6: 
7 
8 


BestSign = BestSign U{Sope} 
BestKSign = {Sopz} 
: 0 = threshold(Z, w,AllKSign) 
9: while Sopt 4 Ú and |BestKSign| < w do 


10: Sopt = argminge anisign L(a, 9), ASi € BestK Sign, d(Si,S) < 0 
11: BestKSign = BestKSign U{Sopt} 
12: return argmingepestgign L(a, S) 


Algorithm 2. Distance threshold computation. 
1: function THRESHOLD((Z, w, All Sign) 
2; KBest = 8 * |All Sign] 
3: BestS = GetBestSign(AllSign, KBest) 
4: return argming{N(0),N(0) = {S € BestS,d(S, BestS[0]) < 0}, N(0) > 
|BestS|/w} 


have segmentations that are very similar. The widening technique [9] solves this 
issue by adding a diversity constraint into the beam. Different constraints exist 
[5,6,9], but a common solution is to add a distance constraint between each pair 
of elements in the beam: all pairwise distances between the signatures in the 
beam have to be larger than a given threshold 0. As this threshold is dependent 
on the data and the beam width, we propose a method to automatically set its 
value. 

Algorithm 1 presents the proposed widening algorithm. Line 3 iterates over 
the number of segments. Line 4 computes all signatures having k segments that 
are considered to enter the beam. More specifically, function Split Segment com- 
putes the direct refinements of each of all signatures in BestK Sign. A direct 
refinement of a signature corresponds to splitting one segment in the segmen- 
tation associated with that signature. Line 5 selects the refinement having the 
smallest code length. If several refinements yield the smallest code length, one 
of these refinements is chosen at random. Lines 8 to 11 perform the widening 
step by adding new signatures to the beam while respecting the pairwise dis- 
tance constraint. Line 8 computes the distance threshold (@) depending on the 
diversity parameter (8), the beam width (w), and the current refinements. Algo- 
rithm 2 presents the details of the threshold computation. With this threshold, 
we recursively add a new element in the beam, until either the beam is full or no 
new element can be added (line 9). Lines 10 and 11 add the signature having the 
smallest code length and being at a distance of at least 0 to any current element 
of the beam. Line 12 returns the best overall signature we have encountered. 
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Distance Between Signatures. We now define the distance measure for signa- 
tures (used in line 10 of Algorithm 1). As the purpose of the signature distance 
is to ensure diversity in the beam, we will use the segmentation to define the dis- 
tance between two elements of the beam, i.e., between two signatures. Terzi et al. 
[10] presented several distance measures for segmentations. The disagreement dis- 
tance is particularly appealing for our purposes as it compares how transactions 
belonging to the same segment in one segmentation are allocated to the other seg- 
mentation. Let Sa = (Sa1 --- Sak) and Sp = (Sp, . . - Sek) be two k-segmentations 
of a sequence a. We denote by d(Sa, Sb) the disagreement distance between seg- 
mentation a and segmentation b. The disagreement distance corresponds to the 
number of transaction pairs that belong to the same segment in one segmentation, 
but that are not in the same segment in the other segmentation. Techniques on 
how to efficiently compute this distance are presented in [10]. 


Defining a Distance Threshold. Algorithm1 uses a distance threshold 0 
between two signatures, that controls the diversity constraint in the beam. If 
0 is equal to 0, there is no diversity constraint, as any distance between two 
different signatures is greater than 0. Higher values of 0 enforce more diversity 
in the beam: good signatures will not be included in the beam if they are too 
close to signatures already in the beam. However, setting the 0 threshold is not 
easy. For example 0 depends on the beam width w. Indeed, with large beam 
widths, 0 should be low enough to allow many good signatures to enter the 
beam. 

To this end, we introduce a method that automatically sets the 0 parame- 
ter, depending on the beam width and on a new parameter (3 that is easier to 
interpret. The @ parameter ranges from 0 to 1 and controls the strength of the 
diversity constraint. The intuition behind £ is that its value will approximately 
correspond to the relative rank of the worst signature in the beam. For example, 
if 3 is set to 0.2, it means that signatures in the beam are in the top-20% in 
ascending order of code length. Algorithm 2 details how 6 is derived from 8 and 
w; this algorithm is called by the threshold function in line 8 of Algorithm 1. 

Knowing the set of all candidate signatures that are considered to enter 
the beam, we retain only the proportion 6 of the best signatures (line 3 of 
Algorithm 2). Then, in line 4 we extract the best signature. Finally, we look for 
the distance threshold @ such that the number of signatures within a distance of 
0 from the best signature is equal to the number of considered signatures divided 
by the beam width w (line 5). The rationale behind this threshold is that since 
we are adding w signatures to the beam and we want to use the proportion ( of 
the best signatures, the distance threshold should approximately discard 1/w of 
the proportion 8 of the best signatures around each signature of the beam. 
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6 Experiments 


This section, analyzes runtimes and code lengths of variants of our algorithm on 
a real retail dataset”. We show that our method runs significantly faster than 
the naive baseline, and give advice on how to choose the w and ( parameters. 
Next, we illustrate the usefulness of the encoding to analyze retail customers. 
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Fig. 3. Left: Mean relative code length for different instances of the widening algo- 
rithm. For each customer, the relative code length is computed with regard to the 
smallest code length found for this customer. Averaging these lengths across all cus- 
tomers gives the mean relative code length. The 8 parameter sets the diversity con- 
straint and w the beam width. The solid black line shows the mean code length of 
the naive algorithm. Bootstrapped 95% confidence intervals [1] are displayed. Right: 
Mean runtime in seconds for different instances of the widening algorithm. The dotted 
black lines shows a bootstrapped 95% confidence interval of the naive algorithm’s mean 
runtime. 


6.1 Algorithm Runtime and Code Length Analysis 


We here analyze the runtimes and code lengths obtained by variants of Algo- 
rithm 1. 3000 customers having more than 40 baskets in the Instacart 2017 
dataset are randomly selected. Customers having few purchases are less rel- 
evant, as we are looking for purchase regularities. These 3000 customers are 
analyzed individually, hence the algorithm is evaluated on different sequences. 


? Code is available at https://bitbucket.org/clement_gautrais/mdl_signature ida 
2020/. 

3 The Instacart Online Grocery Shopping Dataset 2017, Accessed from https: //www. 
instacart.com/datasets/grocery-shopping-20170n05/04/2018. 
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Code Length Analysis. To assess the performance of the different algorithms, 
we analyze the code length yielded by each algorithm on each of these 3000 
customers. We evaluate different instances of the widening algorithm with dif- 
ferent beam widths w and diversity constraints 3. The resulting relative mean 
code lengths per algorithm instance are presented in Fig. 3 left. When increasing 
the beam width, the code length always decreases for a fixed ĝ value. This is 
expected, as increasing the beam size allows the widening algorithm to explore 
more solutions. As increasing the beam size improves the search, we recommend 
setting it as high as your computational budget allows you to do. 

Increasing the 8 parameter usually leads to better code lengths. However, for 
w = 5, higher 8 values give slightly worse results. Indeed, if @ is too high, good 
signatures might not be included in the beam, if they are too close to existing 
solutions. Therefore, we recommend setting the @ value to a moderate value, 
for example between 0.3 and 0.5. A strong point of our method is that it is not 
too sensitive to different @ values. Hence, setting this parameter to its optimal 
value is not critical. The enforced diversity is highly relevant, as a fixed beam 
size with some diversity finds code lengths that are similar to the ones found by 
a larger beam size with no diversity. For example, with w = 5 and @ = 0.3, the 
code lengths are better than with w = 10 and 8 = 0. As using a beam size of 
5 with 8 = 0.3 is faster than using a beam size of 10 with 8 = 0, it shows that 
using diversity is highly suited to decrease runtime while yielding smaller code 
lengths. 


Runtime Analysis. We now present runtimes of different widening instances in 
Fig. 3 right. The beam width mostly influences the runtime, whereas the ( value 
has a smaller influence. Overall, increasing ( slightly increases computation time, 
while yielding a noticeable improvement in the resulting code length, especially 
for small beam sizes. Our method also runs 5 to 10 times faster than the naive 
method. In this experiment, customers have a limited number of baskets (at 
most 100), thus the O(n*) complexity of the naive approach exhibits reasonable 
runtimes. However in settings with more transactions (retail data over a longer 
period for example), the naive approach will require hours to run, and the per- 
formance gain of our widening approach will be a necessity. Another important 
thing is that the naive method has a high variability in runtimes. Confidence 
intervals are narrow for the widening algorithm (they are barely noticeable on 
the plot), whereas it spans over 5s for the naive algorithm. 


6.2 Qualitative Analysis 


Figure 4 presents two signatures of a customer, to illustrate that signatures are 
of practical use to analyze retail customers, and that finding signatures with 
smaller code lengths is of interest. We use the widening algorithm to get a 
variety of good signatures according to our MDL encoding. The top signature in 
Fig. 4 is the best signature found: it has the smallest code length. This signature 
seems to correctly capture the regular behavior of this customer, as it contains 
7 products that are regularly bought throughout the whole purchase sequence. 
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Fig. 4. Example of two signatures found by our algorithms. Gray vertical lines are seg- 
ments boundaries and each dot represents an item occurrence in a purchase sequence. 
Top: best signature (code length of 5221.33 bits) found by the widening algorithm, 
with w = 20 and 8 = 0.5. Bottom: signature found by the beam search algorithm: 
w = 1 and 8 = 0, with a code length of 5338.46 bits (the worst code length). 


Knowing these 7 favorite products, a retailer could target its offers. The segments 
also give some information regarding the temporal behavior of this customer. For 
example, because segments tend to be smaller and more frequent towards the 
end of the sequence, one could guess that this customer is becoming a regular. 

On the other hand, the bottom signature is significantly worse than the top 
one. It is clear that it mostly contains products that are bought only at the 
end of the purchase sequence of this customer. This phenomenon occurs because 
the beam search algorithm, with w = 1, only picks the best solution at each 
step of the algorithm. Hence, it can quickly get stuck in a local minimum. This 
example shows that considering larger beams and adding diversity is an effective 
approach to optimize code length. Indeed, having a large and diverse beam is 
necessary to have the algorithm explore different segmentations, yielding better 
signatures. 


7 Conclusions 


We tackled the problem of automatically finding the best number of segments for 
signature patterns. To this end, we defined a model selection problem for signa- 
tures based on the minimum description length principle. Then, we introduced 
a novel algorithm that is an instance of widening. We evaluated the relevance 
and effectiveness of both the problem formalization and the algorithm on a 
retail dataset. We have shown that the widening-based algorithm outperforms 
the beam search approach as well as a naive baseline. Finally, we illustrated 
the practical usefulness of the signature on a retail use case. As part of future 
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work, we would like to study our optimization techniques on larger databases 
(thousands of transactions), like online news feeds. We would also like to work on 
model selection for sets of interesting signatures, to highlight diverse recurrences. 
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Abstract. We introduce a novel efficient approach for community detec- 
tion based on a formal definition of the notion of community. We name 
the links that run between communities weak links and links being inside 
communities strong links. We put forward a new objective function, 
called SIWO (Strong Inside, Weak Outside) which encourages adding 
strong links to the communities while avoiding weak links. This process 
allows us to effectively discover communities in social networks without 
the resolution and field of view limit problems some popular approaches 
suffer from. The time complexity of this new method is linear in the 
number of edges. We demonstrate the effectiveness of our approach on 
various real and artificial datasets with large and small communities. 


Keywords: Community detection - Social network analysis 


1 Introduction 


Community detection is an important task in social network analysis and can 
be used in different domains where entities and their relations are presented 
as graphs. It allows us to find linked nodes that we call communities inside 
graphs. There are community detection methods that partition the graph into 
subgroups of nodes such as the spectral bisection method [4] or the Kernighan- 
Lin algorithm [27]. There are also hierarchical methods such as the divisive 
algorithms based on edge betweenness of Girwan et al. [18] or agglomerative 
algorithms based on dynamical process such as Walktrap [20], Infomap [24] or 
Label propagation [22]. We do not detail them and refer the interested reader to 
(7, 10,12], but we come back on another class of hierarchical algorithms that aim 
at maximizing Q-modularity introduced by Newman et al. [18]. After the greedy 
agglomerative algorithm initially introduced by Newman [19], Blondel et al. [5] 
proposed Louvain, one of the fastest algorithms to optimize Q-modularity and 
to solve the community detection task. However, Fortunato et al. [11] showed 
that Q-Modularity suffers from the resolution limit which means by optimizing 
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Q-modularity, communities that are smaller than a scale cannot be resolved. 
The field of view limit [25] is in contrast to the resolution limit leads to overpar- 
titioning the communities with a large diameter. 

To overcome the resolution limit of Q-modularity, several proposals have been 
made, notably by [2,17,23], who introduced variants of this criterion allowing 
the detection of community structures at different levels of granularity. However, 
these revised criteria make the method time-consuming since they require to 
tune a parameter. Therefore, we retain the greedy approach of Louvain for its 
efficiency and ability to handle very large networks, but we introduce SIWO 
because it relies on the notions of strong and weak links defined in Sect. 2. 

We consider that a community corresponds to a subgraph sparsely connected 
to the rest of the graph. Contrary to the majority of methods which do not for- 
mally define what is a community and simply consider that it corresponds to a 
subset of nodes densely connected internally, we define the conditions a subgraph 
should meet to be considered as a community in Sect. 2. In Sect.3, we present 
the generic community detection algorithm. We can apply this general process 
regardless of the objective function to improve other community detection meth- 
ods as our experiments show. 

Finally, the extensive experiments described in Sects. 4 and 5, confirm that 
our objective function is less sensitive to the resolution and the field of view limit 
compared to the objective functions mentioned earlier. Also, our algorithm has 
consistently good performance regardless of the size of communities in a network 
and is efficient on large size networks having up to a million edges. 


2 Notations and Definitions 


2.1 Strong and Weak Links 


A community is oftentimes defined as a subgraph in which nodes are densely 
connected while sparsely connected to the rest of the graph. One way to find 
such subgraphs is to divide the network into parts so that the number of links 
lying inside that part is maximized. However, if there is no prior information 
about the number of communities or their sizes, one can maximize the number 
of links within communities by putting all the nodes in one community, but the 
final result will not be the true communities. To avoid this approach, we penalize 
the missing links within the communities and we introduce the notions of strong 
and weak links. 
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Fig. 1. A network with two communi- Fig. 2. A network with 2 communities 


ties; each consists of a clique of size 5. and 4 dangling nodes (1, 2, 3, and 4). 
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Weak links lie between communities, while strong links are inside them. We 
develop our criterion so that it encourages adding strong links to the communi- 
ties while avoiding weak ones instead of penalizing the missing links. As these 
different types of links play different roles in graph connectivity; removing a 
weak link may divide the graph into disconnected subgraphs, whereas removing 
a random link would not. Let us focus on the link between nodes 7 and j in Fig. 1 
and also the link between nodes j and k in this graph. Node j is connected to 
all the neighbors of node k, whereas node i and j have no common neighbors. 
As generally, nodes in the same community are more likely to have common 
neighbors, (i, j) can be considered as a weak link whereas (j, k) as a strong link 
and it is exactly what we want to capture through weights assigned to the links. 


2.2 Edge Strength 


Given a graph G = (V, E) where V is the set of nodes and E the set of edges, we 
propose to assign a weight in the range of (—1,1) to each edge; such that strong 
links have larger weights. As nodes in the same community tend to have more 
common neighbors compared to nodes in different communities, if Sry > Szy 
then ez, is more likely to be a strong link compared to ezy with Szy defined by: 


Sry = {k E V : (a,k) € E,(y,k) € E}| (1) 


We can compare two links according to S only if they share a node. Thus, if 
we consider nodes x and y that have 5 and 20 links incident to them, then S 
can be in range of [0,4] and [0,19] for x and y respectively. Consequently, for 
comparisons, we have to scale down S values to (—1,1). If Szy has the maximum 
value of S7¢” (SPO = maxy.(«,y)ee Szy) for a particular node x. We divide the 
range [—1, 1] into S7’** + 1 equal length segments. Each S value in the range of 
(0, $72”) is then mapped to the center of (n + 1)!” segment using equation: 
2 1 


Baas 1 2 
ae "gmaz +] 9 Smaz +1 (2) 


where wy, is the scaled value of S+y from the viewpoint of node x (min-max 
normalization could also work). We can also scale Szy from the viewpoint of 
node y: wh, = Soy gpI H spasi 1 where S47 = mMaXxg:(y,s)eE Say. To 
decide whether we should trust x or y, we need to look at the importance of 
each one in the network. Local clustering coefficient (CC) [28], given below, is a 
measure that reflects the importance of nodes and it can be computed even on 
large graphs, for instance with Mapreduce [15]. 


Heij -i,j E Nz, ij € E}| 
= de 
(7) 
where d, and N, are respectively the degree and the set of neighbors of node zx. 


CC is in the range of [0,1] with 1 for nodes whose neighbors form cliques, and 
0 for nodes whose neighbors are not connected to each other directly. Here, we 


CC(x) 


(3) 
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scale each edge from the viewpoint of the endpoint that is more likely to be in 
a dense neighborhood characterized by a large CC: 


wy? 


P e if CO(£) > CC(y) (4) 


Zy; otherwise 


2.3 SIWO Measure 


The new measure that we propose encourages adding strong links into the com- 
munities while keeping the weak links outside of the communities (Strong Inside, 
Weak Outside). This measure is defined as follows: 


wij (Ci, cj) 


sIwo= X 5 


i, jEV 


(5) 


where c; is the community of node i and (x,y) is 1 if x = y and 0 otherwise. 
SIWO is the sum of weights of the edges that reside in the communities. This 
objective function provides a way to partition the set of nodes but it does not 
specify the conditions required by a subset of nodes to be a community. These 
conditions are defined in the following. 


2.4 Community Definition 


Following [21] we consider that a subgraph C is a community in a weak sense if 
the following condition is satisfied: 


SINSI > SIN N] (6) 


vEC vEC 


where N, is the set of the neighbors of node v and NC is the set of the neighbors 
of node v that are also in community C. This condition means that the collective 
of the nodes in a community have more neighbors within the community than 
outside. In this paper, we expand this definition by adding one more condition. 
Given a partition p = {C1, C2, ..., C1} of a network, subgraph C; is considered 
as a qualified community if it satisfies the following conditions: 


1. C; is a community in a weak sense (Eq. 6). 
2. The number of links within C; exceeds the number of links towards any other 
subgraph C; (j # i) in the partition p taken separately, such that: 


5 INST > INS € [Lhd #8 (7 


vECi veEC; 
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3 The SIWO Method 


This method has four steps: pre-processing, optimizing SIWO, qualified commu- 
nity identification, and post-processing. They are discussed in detail below. 


Step 1. Pre-processing 

The first step calculates the edge strength weights (w;;) needed during the SIWO 
optimization. Moreover, to reduce the computational time, we remove the dan- 
gling nodes temporally. Node x is a dangling node if there exists node y such that 
by removing ez,, the network would be divided into two disconnected parts with 
part, (the part containing node x) being a tree. Since part, has a tree structure, 
it cannot form a community on its own. So all the nodes in part, belong to the 
same community as node y. In Fig. 2, nodes 1, 2, 3 and 4 are dangling nodes and 
they belong to the same community as node 5, unless we consider them outliers. 
Even though such tree-structured subgraphs attached to the network are very 
sparse and cannot be considered as communities, they satisfy Eqs. (6) and (7) 
defined for qualified communities. So we do not need to consider them during the 
community detection process. To remove them (and the links incident to them), 
we need to investigate every node of the network in the first time to identify 
nodes with degree of 1. However, after the first visit, we only need to check the 
list of the neighbors of the nodes that are removed in the previous time. 


Step 2. Optimizing SIWO 

We use Louvain’s optimization process to maximize SIWO since it has been 
proven to be very efficient but we replace the modularity by our criterion. This 
greedy optimization process has two main phases, iteratively performed until a 
local maximum of the objective function (SIWO measure) is reached. The first 
phase starts by placing each node of graph G in its community. Then each node 
is moved to the neighbor community which results in the maximum gain of the 
SIWO value. If no gain can be achieved, the node stays in its community. In the 
second phase, a new weighted graph G” is created in which each node corresponds 
to a community in G. Two nodes in G’ are connected if there exists at least one 
edge lying between their corresponding communities in G. Finally, we assign 
each edge ery in G” a weight equal to the sum of the weights of edges between 
the communities that match with x and y. These two phases are repeated until 
no further improvement in the SIWO objective function can be achieved. 


Step 3. Qualified Community Identification 

This step determines qualified communities complying with Eqs. (6) and (7) for 
the dense subgraphs discovered in the previous step. However, there may exist 
communities consisting of one node weakly connected to all of its neighbors 
(sire” = 0) and that have links with non-positive weight incident to it, we 
call them Lone communities. Since the decision about the communities of such 
nodes can not be made on edge strength, we let the majority of their neighbors 
decide about their communities but, to reduce the computational time, like for 
dangling nodes, we temporarily remove these nodes in this step and bring them 
back in the final step. Then, we identify the unqualified communities which do 
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not satisfy Eqs. (6) or (7). We keep merging each unqualified community with 
one of its neighboring communities (qualified or not) until no more unqualified 
community exists. For that, first, we assign a weight equal to 1 to each edge. 
Then, we repeat the two phases of Louvain. In phase 1, we create a new graph 
G* in which each node corresponds to a community identified in step 3 for the 
first iteration of in phase 2 for the next ones and where each edge ezy is assigned 
a weight equal to the sum of the weights of edges between the communities that 
correspond to x and y. We also add a self-loop to each node that has a weight 
equal to the sum of the weights of the edges that reside in its corresponding 
community. In phase 2, we visit all nodes in G*. If a node x has a self-loop with 
a weight that is larger than (1) half of sum of the weights of the edges incident 
to it and (2) weight of any edge connecting x to another node in G*, it means 
the community assigned to x satisfies both the conditions in Eqs. (6) and (7), 
we let x stay in its community. Otherwise, we move node x to the neighboring 
community that results in the maximum decrease in the sum of the weights of 
the edges that lie between communities of G*. 


Step 4. Post-processing 

Finally, each lone community that was temporarily removed is sequentially added 
back to the network and merged with the community in which it has the most 
neighbors. If two or more communities tie and they have more than one con- 
nection to the node, then one is chosen at random. Otherwise, we choose the 
community of the most important neighbor, based on the largest degree of cen- 
trality within its community. Since we add lone nodes one after the other, the 
community that a former node is assigned to, might not be the best for that 
node. To resolve this issue, once all lone nodes are added to the network, we 
repeat moving each one of them to the community of the majority of its neigh- 
bors until no further movement can be made. Dangling nodes are also added to 
the network in the reverse order that they were removed and they are assigned 
to the community of their unique neighbor. 


4 The Resolution Limit of SIWO 


Fortunato and Barthélemy [11] used two sample networks, shown in Fig. 3, to 
demonstrate how Q-modularity is affected by the resolution limit. The first exam- 
ple is a ring of cliques where each clique is connected to its adjacent cliques 
through a single link. If the number of cliques is larger than about ./m with m 
being the total number of edges in the network, then optimizing Q-modularity 
results in merging the adjacent cliques into groups of two or more, despite that 
each clique corresponds to a community. The second example is a network con- 
taining 4 cliques: 2 of size k and 2 of size p. If k >> p, Q-modularity similarly 
fails to find the correct communities and the cliques of size p will be merged. 
To prove how SIWO resolves the resolution limit of Q-modularity, the exact 
structure of the network should be known; which is not possible. So, we analyze 
whether SIWO is affected by the resolution limit on these networks Given the 
definition of SIWO, let us consider the edge ez, between two adjacent cliques 
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Fig. 3. Schematic examples (a) a ring of cliques; adjacent cliques are connected through 
a single link (b) a network with 2 cliques of size k and 2 cliques of size p. 


in the first network. Since x and y do not have any common neighbors, the 
edge between them has a non-positive weight. Therefore, by maximizing SIWO 
measure in our algorithm, the adjacent cliques will not be merged. For the edge 
exy between the cliques of size p in the second network, since x and y have at 
most one common neighbor, the edge between them has a non-positive weight. 
Therefore, the cliques in the second network will not be merged either. 


5 Experimental Results 


We compared the performance of our method with the most widely used and 
efficient algorithms, as pointed out in several recent state of art studies [8,29], 
on both real and synthetic networks. The algorithms are: 1- Fastgreedy [6]; 2- 
Infomap; 3- Infomap+ which is Infomap to which we added the third step of 
our algorithm (to relieve its sensitivity to the field of view limit and demon- 
strate that our framework can be used to improve other algorithms); 4- Label 
Propagation [22]; 5- Louvain! [5]; 6- Walktrap? [20]. It should be noted that 
Infomap is the only algorithm that suffers from the filed of view limit among 
these algorithms. 

The results are evaluated according to the Adjusted Rand Index (ARI) [14] 
and Normalized Mutual Information (NMI) [26]. As both ARI and NMI show 
similar results, we only present ARI results for lack of space. We also compared 
the results of different methods according to the ratio of the number of detected 
communities over the true number of communities in the ground-truth to observe 
how a method is affected by the resolution and the field of view limits. 


5.1 Real Networks 


We used 5 real networks and the ground-truth communities are available for 4 
of them. Table 1 presents the properties of these networks. 

We compared SIWO and Louvain on Eurosis network [9] which represents 
scientific web pages from 12 European countries and the hyperlinks between 
them without known ground-truth communities. However, since each European 
country has its own language, web pages in different countries are sparsely con- 
nected to each other. Moreover, as reported in [9], some of the countries can be 


1 https://github.com/taynaud/python-louvain. 
? https://www-complexnetworks.lip6.fr/~latapy/PP /walktrap.html. 
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Table 1. Properties of real networks 


Network ##nodes | #edges | #C | Network #nodes | #edges | #C 
Karate [30] | 34 78 2 | Eurosis [9] | 1218 5999 
Polbooks* |105 441 3 | Polblogs [1] | 1222 16717 |2 


Football [13] |115 613 |12 
“http://www.orgnet.com 


divided into smaller components e.g. Montenegro network includes three com- 
ponents: 1- Telecom and Engineering, 2- Faculties and 3- High Schools. Louvain 
detects 13 communities whereas SIWO detects 16 communities in this network. 
Louvain assigns all nodes in Montenegro network to one giant community. How- 
ever, SIWO puts Faculties and High Schools in one community and Telecom 
and Engineering web pages in another community. These two communities are 
connected to each other with only 7 links. However, Louvain cannot separate 
them due to its resolution limit. 


Table 2. Comparison of 7 algorithms according to ARI and the ratio of the number 
of detected communities over the true number of communities in the ground-truth on 
real networks. Tables shows the average results and standard deviation computed on 
10 iterations of the algorithms on each network. 


Karate Polbooks | Football | Polblogs 


SIWO ARI 10 0.67+0 0.7940 |0.77+0 
C/C, 1+0 L3 10 1.50 
Fastgreedy | ARI (0.68 +0 0.63 +0 |0.47+0 |0.78+0 
C/C, 1540 1.30 0.5+0 |540 
Infomap ARI 0.70 0.64+0 0.84+0 |0.68+0 
C/C, 1.5+0 16+0 0.9+ 17.5+0 
Infomap+ |ARI |0.70+0 |0.66+0 |0.84+40/0.76+0 
C/C,y 1540 |13+40 09+ 1.50 
Label_prop| ARI | 0.66+0.3/0.66+0 |0.73+0 |0.8+0 
C/C, 1.2 +0.35|1.1+0.1 0.8 +0.1|/2.1+0 


Louvain |ARI 0.4640 |0.55+0 0.8+0 /0.7740 
C/C, 2+0 13+0 0.8+0 |450 
Walktrap | ARI (0.3240 /0.65+0 0.81+0 |0.76+0 
C/C, 3+0 1340 |08+0 |55+0 


Table2 presents the comparison with respect to ARI and C/C,, the ratio 
of the number of detected communities over the true number of communities 
(both ARI and C/C, should be as close to 1 as possible) in the ground-truth, 
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on real networks with ground-truth communities. It shows that SIWO performs 
better on Karate and Polbooks based on ARI. It also outperforms the others 
methods on Karate, Football, and Polblogs networks according to C/C; measure 
(SIWO could detect the exact communities with respect to the ground-truth on 
these networks). Infomap detects a considerably larger number of communities 
in Polblogs network which indicates this algorithm is sensitive to the field of view 
limit [25]. However, Infomap+ is much less sensitive to this limit which implies 
the third step of SIWO, added to Infomap-, is effective in resolving the field 
of view limit. Considering results for all networks, SIWO is the top performer 
among these algorithms on a variety of networks. 


5.2 Synthetic Networks 


To analyze the effect of the resolution and field of view limit, it is important to 
test how community detection algorithms perform on networks with small/large 
communities. Therefore, in this work we generated two sets of networks using 
LFR [16] to test the different algorithms: one with large communities and one 
with small communities. The first set is in favor of algorithms that suffer from 
resolution limit such as Louvain and the second set is in favor of algorithms with 
field of view limit such as Infomap. Each set includes networks with a varying 
number of nodes and mixing parameter. The mixing parameter controls the frac- 
tion of edges that lie between communities. We do not generate networks with 
mixing parameter >0.5 since beyond this point and including 0.5, the communi- 
ties in the ground truth no longer satisfy the definition of community. The input 
parameters used to generate these two sets are presented in Table 3. Figures 4 
and 5 present respectively ARI or the ratio of the number of detected com- 
munities over the true number of communities (C/C;,). Panels correspond to 
networks with a specific number of nodes (1000 to 100000) and they are divided 
into two parts; the lower (respectively upper) part illustrates the average ARI 
(or C/C;,) (respectively standard deviation) computed over 20 graphs (10 small 
and 10 large communities) as a function of the mixing parameter. 


Table 3. Input parameters of LFR benchmark: Set 1 contains networks with large com- 
munities and Set 2 contains networks with small communities. For each combination 
of parameters we generated 10 networks. 


Set 1 Set 2 
#nodes (N) [1, 10, 50, 100] x 10° | [1, 10,50, 100] x 10° 
Average and max degrees 20 - N/10 20 -VN 
Mixing parameter [1,...,7] x 0.1 [1,...,7] x 0.1 
Min and max community sizes | N/20 - N/10 Default - by default VN 


Figure 4 shows the performance of Fastgreedy decreases as the mixing param- 
eter increases. Louvain and Walktrap perform well on the smallest networks in 
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the set; however, its performance drops when we apply it to the networks with 
sizes 50000 and larger. Label propagation, Infomap and Infomap+ perform well 
up to when the mixing parameter reaches 0.3. However, a larger mixing param- 
eter causes a rapid decrease in the ARI value when applying these algorithms to 
the two largest networks in the set. These three algorithms have a large standard 
deviation and their outputs are not stable on these networks. SIWO correctly 
detects the communities when the mixing parameter is less than or equal to 0.3 
(ARI ~ 1) regardless of size of the network and has the best performance overall. 
Figure 5 clearly shows the resolution limit of Louvain and Fastgreedy as they 
underestimate the number of communities. SIWO is the best performer in terms 
of the number of communities and it has a very small standard deviation whereas, 
Infomap+ and Label propagation have a large standard deviation and fail to find 
the correct number of communities when the mixing parameter exceeds 0.3. 


1000 10000 1000 10000 


Std 


oooooomooooo 
BOOUMROOOFNUR 
2099990900909 
BOOUMOSOHNUE 


= 
7|._—»——+—__,| [== siwo 
i ~ ee Label_prop 
4 + Infomap+ 
—* Louvain 

aa Fastgreedy 


ARI 
BOOUMe.eCCS N P 
poopooro o o 
BUOQUeLSO N A 


9, 
Hl 
ol 
Ñ| 
l 
a 
o| 
F 
9, 


; == siwo g 
; s—o Label_prop 8 
? * Infomap+ 


I 02 03 04 O1 02 0.3 0.4 |*= Louvain 
50000 100000 prow Festareedy 


<— Walktrap 

— Infomap D” S- — 

; a 

7 FSS P — 
o> 

S 


oI 02 03 04 O1 02 03 04 oI 02 03 04 O1 02 03 04 
Mixing parameter Mixing parameter Mixing parameter Mixing parameter 


.1 0.2 0.3 0.4 


2 


Std 


eossoorsssss 
Soe oeob Sous 
2999090FS9999 
EEEE TAINS 


ii 
{ 


| 


ARI 
Seer ee eS S 


BUDUB®LOO NR 


Fig. 4. Evaluation according to ARI Fig. 5. Evaluation of SIWO, Label 

on synthetic networks generated with propagation, Infomap+, Louvain and 

LFR. Fastgreedy according to C/C, on syn- 
thetic networks generated with LFR. 


6 Scalability 


We analyze how the computational cost of SIWO varies with the size of the 
network. The pre-processing step has two phases: removing dangling nodes which 
requires a time of the order of n where n is the number of nodes, and calculating 
the edge strength weights which requires a time of the order of nd? = 2md where 
m is the number of edges and d is the average degree. In many real networks d is 
much smaller than n and it does not grow with n [10]. The second and third step 
follows the same greedy process as Louvain does. Louvain is theoretically cubic 
but was demonstrated experimentally to be quasi-linear [3] and has been applied 
with success to handle large size networks having several million nodes, and 100 
million links. The time complexity of the post-processing step depends on the 
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number of Lone communities and if all the nodes are in Lone communities, it 
requires a time O(nd?). Overall, the time complexity of SIWO is O(n + md), 
which is similar to Louvain due to the fact that d is small and n = 2m/d. 
SIWO can detect communities in a networks with 100000 nodes and 1 million 
edges, in about 1 min on a commodity i7 and 8GB RAM laptop. The current 
implementation of SIWO is in Python®, derived from python-louvain. 


7 Conclusion 


This paper introduces SIWO, a novel objective function based on edge strength 
for community detection, and a formal definition of community, that we use to 
lead the community detection process after optimizing the objective function. 
This framework can also be applied to other community detection methods to 
remedy their inability that causes the resolution or the field of view limit. Our 
extensive experiments using both small and large networks confirm that our algo- 
rithm is consistent, effective and scalable for networks with either large or small 
communities demonstrating less sensitivity to the resolution limit and field of 
view limit that most community mining algorithms suffer from. As a future direc- 
tion, we will generalize the proposed algorithm for weighted/directed networks. 
Notably, SIWO algorithm can be easily generalized to handle weighted graphs. It 
requires only to adjust the pre-processing step by combining the weights from the 
input graph and the weights computed by SIWO to evaluate the edge strength. 
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Abstract. Multi-label classification in deep learning is a practical yet 
challenging task, because class overlaps in the feature space means that 
each instance is associated with multiple class labels. This requires a pre- 
diction of more than one class category for each input instance. To the 
best of our knowledge, this is the first deep learning study which quan- 
tifies uncertainty and model interpretability in multi-label classification; 
as well as applying it to the problem of recognising proteins expressed 
in cell types in testes based on immunohistochemically stained images. 
Multi-label classification is achieved by thresholding the class proba- 
bilities, with the optimal thresholds adaptively determined by a grid 
search scheme based on Matthews correlation coefficients. We adopt MC- 
Dropweights to approximate Bayesian Inference in multi-label classifica- 
tion to evaluate the usefulness of estimating uncertainty with predictive 
score to avoid overconfident, incorrect predictions in decision making. 
Our experimental results show that the MC-Dropweights visibly improve 
the performance to estimate uncertainty compared to state of the art 
approaches. 


Keywords: Uncertainty estimation - Multi-label classification - Cell 
type prediction - Human Protein Atlas - Proteomics 


1 Introduction 


Proteins are the essential building blocks of life, and resolving the spatial distri- 
bution of all human proteins at an organ, tissue, cellular, and subcellular level 
greatly improves our understanding of human biology in health and disease. The 
testes is one of the most complex organs in the human body [15]. The spermato- 
genesis process results in the testes containing the most tissue-specific genes 
than elsewhere in the human body. Based on an integrated ‘omics’ approach 
using transcriptomics and antibody-based proteomics, more than 500 proteins 
with distinct testicular protein expression patterns have previously been identi- 
fied [10], and transcriptomics data suggests that over 2,000 genes are elevated 
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in testes compared to other organs. The function of a large proportion of these 
proteins are however largely unknown, and all genes involved in the complex pro- 
cess of spermatogenesis are yet to be characterized. Manual annotation provides 
the standard for scoring immunohistochemical staining pattern in different cell 
types. However, it is tedious, time-consuming and expensive as well as subject to 
human error as it is sometimes challenging to separate cell types by the human 
eye. It would be extremely valuable to develop an automated algorithm that can 
recognise the various cell types in testes based on antibody-based proteomics 
images while providing information on which proteins are expressed by that cell 
type [10]. This is, therefore, a multi-label image classification problem. 
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Fig. 1. Schematic overview: cell type-specific expression of testis elevated genes [10] 


Exact Bayesian inference with deep neural networks is computationally 
intractable. There are many methods proposed for quantifying uncertainty or 
confidence estimates. Recently Gal [5] proved that a dropout neural network, 
a well-known regularisation technique [13], is equivalent to a specific varia- 
tional approximation in Bayesian neural networks. Uncertainty estimates can 
be obtained by training a network with dropout and then taking Monte Carlo 
(MC) samples of the prediction using dropout during test time. Following Gal 
[5], Ghoshal et al. [7] also showed similar results for neural networks with Drop- 
weights and Teye [14] with batch normalisation layers in training (Fig. 1). 

In this paper, we aim to: 


1. Present the first approach in multi-label pattern recognition that can recog- 
nise various cell types-specific protein expression patterns in testes based 
on antibody-based proteomics images and provide information on which cell 
types express the protein with estimated uncertainty. 

2. Show Multi-Label Classification (MLC) is achieved by thresholding the class 
probabilities, with the Optimal Thresholds adaptively determined by a grid 
search scheme based on Matthews correlation coefficient. 
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3. Demonstrate through extensive experimental results that a Deep Learning 
Model with MC-Dropweights [7] is significantly better than a wide spectrum 
of MLC algorithms such as Binary Relevance (BR), Classifier Chain (CC), 
Probabilistic Classifier Chain (PCC) and Condensed Filter Tree (CFT), Cost- 
sensitive Label Embedding with Multidimensional Scaling (CLEMS) and 
state-of-the-art MC-Dropout [5] algorithms across various cell types. 

4. Develop Saliency Maps in order to increase model interpretability visualizing 
descriptive regions and highlighting pixels from different areas in the input 
image. Deep learning models are often accused of being “black boxes”, so 
they need to be precise, interpretable, and uncertainty in predictions must be 
well understood. 


Our objective is not to achieve state-of-the-art performance on these prob- 
lems, but rather to evaluate the usefulness of estimating uncertainty leveraging 
MC-Dropweights with predictive score in multi-label classification to avoid over- 
confident, incorrect predictions for decision making. 


2 Miulti-label Cell-Type Recognition and Localization 
with Estimated Uncertainty 


2.1 Problem Definition 


Given a set of training data D, where X = {z1, £2... £y } is the set of N images 
and the corresponding labels Y = {y1, Y2... yn} is the cell-type information. 
The vector y; = {Yi1,Yi2---Yi,} is a binary vector, where y;,; = 1 indicates 
that the i*” image belongs to the j*” cell-type. Note that an image may belong to 
multiple cell-types, i.e., 1 <= )7, yi,j <= M. Based on D(X, Y), we constructed 
a Bayesian Deep Learning model giving an output of the predictive probability 
with estimated uncertainty of a given image x; belonging to each cell category. 
That is, the constructed model acts as a function such that f : X — Y using 
weights of neural net parameters w where (0 <= Ñz, <= 1) as close as possible 
to the original function that has generated the outputs Y, output the estimated 
value (1, ĝi,2; -- -, i,m) as close to the actual value (yj1, Yi,2,---, Yim). 


2.2 Solution Approach 


We tailored Deep Convolutional Neural Network (DCNN) architectures for cell 
type detection and localisation by considering a large image capacity, binary- 
cross entropy loss, sigmoid activation, along with Dropweights in the fully con- 
nected layer and Batch Normalization formulation of propagating uncertainty in 
deep learning to estimate meaningful model uncertainty. 


Multi-label Setup: There are multiple approaches to transform the multi- 
label classification into multiple single-label problems with the associated loss 
function [8]. In this study, we used immunohistochemically stained testes tissue 
consisting of 8 cell types corresponding to 512 testis elevated genes. 
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Therefore, we define a 8-dimensional class label vector Y = {y1, y2...yn} ; 
Y € {0,1}, given 8 cell types. yc indicates the presence with respect to according 
cell type expressing the protein in the image while an all-zero vector [0; 0; 0; 0; 
0; 0; 0; 0] represents the “Absence” (no cell type expresses the protein in the 
scope of any of 8 categories). 


Multi-label Classification Cost Function: The cost function for Multi-label 
Classification has to be different considering the fact that a prediction for a class 
is not mutually exclusive. So we selected the sigmoid function with the addition 
of binary cross-entropy. 


Data Augmentation: We used Keras’ image pre-processing package to apply 
affine transformations to the images, such as rotation, scaling, shearing, and 
translation during training and inference. This reduces the epistemic uncertainty 
during training, captures heteroscedastic aleatoric uncertainty during inference 
and overall improves the performance of models. 


Multi-label Classification Algorithm: In Bayesian classification, the mean 
of the predictive posterior corresponds to the parameter point estimates, and the 
width of the posterior reflects the confidence of the predictions. The output of the 
network is an M-dimensional probability vector, where each dimension indicates 
how likely each cell type in a given image expresses the protein. The number 
of cell types that simultaneously express the protein in an image varies. One 
method to solve this multi-label classification problem is placing thresholds on 
each dimension. However different dimensions may be associated with different 
thresholds. If the value of the it” dimension of ĝ is greater than a threshold, we 
can say that the i-th cell-type is expressed in the given tissue. The main problem 
is defining the threshold for each class label. 

A threshold based on Matthews Correlation Coefficient (MCC) is used on 
the model outcome to determine the predicted class to improve the accuracy of 
the models. 

We adopted a grid search scheme based on Matthews Correlation Coefficients 
(MCC) to estimate the optimal thresholds for each cell type-specific protein 
expression [2]. Details of the optimal threshold finding algorithm is shown in 
Algorithm 1. 

The idea is to estimate the threshold for each cell category in an image sepa- 
rately. We convert the predicted probability vector with the estimated threshold 
into binary and calculate the Matthews correlation coefficient (MCC) between 
the threshold value and the actual value. The Matthews correlation coefficient 
for all thresholds are stored in the vector w, from which we find the index of 
threshold that causes the largest correlation. The Optimal Threshold for the i*” 
dimension is then determined by the corresponding value. We then leveraged 
Bias-Corrected Uncertainty quantification method [6] using Deep Convolutional 
Neural Network (DCNN) architectures with Dropweights [7]. 
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Input: Ground Truth Vector: {yi,1, ys2,---,Yyi,m} 3 

Estimated Probability Vector: {ĝi 1, §,2,---,G,a} 3 

Upper Bound for threshold = 92, and Threshold Stride = S 

Result: The Optimal Thresholds T = (ot1, ot2,..., otm) 

Initialization: The set of threshold T = (oti = 0, ot2 = 0,...,otar = 0) ; 
for i — 1 to M do 

j — 0; 

w + 0; 

m — 0; 

for j < 2 do 

Initialize M-dimensional binary vector v — (v1 = 0, v2 = 0,..., vm = 0) 


if ĝi > j then 
| vi — l; 
end 
else 
| vi — 0; 
end 
w — w.append(MCC(y[1 : i], v)); 
T = T.append(j) ; 
j=j+5S 
end 
M — ATGMALmwW = (wW1,W2,...,Wm,---) ; 
ot*7; = r|m] 
end 


Algorithm 1. Find Optimal Threshold 


Network Architecture: Our models are trained and evaluated using Keras 
with Tensorflow backend. For the DNN architecture, we used a generic build- 
ing block containing the following model structure: Conv-Relu-BatchNorm- 
MaxPool-Conv-Relu-BatchNorm-MaxPool-Dense-Relu-Dropweights and Dense- 
Relu-Dropweights-Dense-Sigmoid, with 32 convolution kernels, 3 x 3 kernel size, 
2x 2 pooling, dense layer with 512 units, 128 units, and 8 feed-forward Drop- 
weights probabilities 0.3. We optimised the model using Adam optimizer with 
the default learning rate of 0.001. The training process was conducted in 1000 
epochs, with mini-batch size 32. We repeated our experiments three times for 
an algorithm and calculated a mean of the results. 


3 Estimating Bias-Corrected Uncertainty Using Jackknife 
Resampling Method 


3.1 Bayesian Deep Learning and Estimating Uncertainty 


There are many measures to estimate uncertainty such as softmax variance, 
expected entropy, mutual information, predictive entropy and averaging predic- 
tions over multiple models. In supervised learning, information gain, i.e. mutual 
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information between the input data and the model parameters is considered as 
the most relevant measure of the epistemic uncertainty [4,12]. Estimation of 
entropy from the finite set of data suffers from a severe downward bias when 
the data is under-sampled. Even small biases can result in significant inaccura- 
cies when estimating entropy [9]. We leveraged Jackknife resampling method to 
calculate bias-corrected entropy [11]. 

Given a set of training data D, where X = {21, 72... £y } is the set of N images 
and the corresponding labels Y = {y1, yo... yn}, a BNN is defined in terms of a 
prior p(w) on the weights, as well as the likelihood p(D|w). Consider class prob- 
abilities p(yx, = c | zi, we, D) with w ~ q(w | D) with W = (w,)£4, a set 
of independent and identically distributed (i.i.d.) samples draws from q(w |, D). 
The below procedure computes the Monte Carlo (MC) estimate of the posterior 
predictive distribution, its Entropy and Mutual Information(MI): 


N 
Divot | si, D) = H(p(u | zi, D) — 5i SO H(ply: | zaw, D)). (1) 


wEW 
where 1 
Ply: | zi, D) = Wi XO p(yi | 2,0, D). (2) 
wEW 
The stochastic predictive entropy is H[y | x,w] = H(p) = — >>. Pe log(fe), 


where Pe = A >>, Ptc is the entire sample maximum likelihood estimator of prob- 
abilities. 

The first term in the MC estimate of the mutual information is called the 
plug-in estimator of the entropy. It has long been known that the plug-in esti- 
mator underestimates the true entropy and plug-in estimate is biased [11,17]. 

A classic method for correcting the bias is the Jackknife resampling method [3]. 
In order to solve the bias problem, we propose a Jackknife estimator to estimate the 
epistemic uncertainty to improve an entropy-based estimation model. Unlike MC- 
Dropout, it does not assume constant variance. If D(X, Y) is the observed random 
sample, the i” Jackknife sample, 2;, is the subset of the sample that leaves-one-out 
observation x; : £i) = (1,..- Li—1, %i41---%n). For sample size N, the Jackknife 


standard error ô is defined as: v Wy) DA (ĉi — ĉ(o))? , where ĉo) is the empir- 


ical average of the Jackknife replicates: x eae ôq). Here, the Jackknife estimator 
is an unbiased estimator of the variance of the sample mean. The Jackknife correc- 
tion of a plug-in estimator H(-) is computed according to the method below [3]: 
Given a sample (p;)/_, with p: discrete distribution on 1...C classes, T corre- 
sponds to the total number of MC-Dropweights forward passes during the test. 


1. for each t = 1...T 
— calculate the leave-one-out estimator: Bo" = TH > jxi Pic 
— calculate the plug-in entropy estimate: H_, = H(p~‘ 
2. calculate the bias-corrected entropy Hz = TH + Gy sae Ai), where 
H_;) is the observed entropy based on a sub-sample in which the ith indi- 
vidual is removed. 
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We leveraged the following relation: 


1 H— Xj 
peg ee 


while resolving the i-th data point out of the sample mean u = zy x; and 
recompute the mean u-i. This makes it possible to quickly calculate leave-one- 
out estimators of a discrete probability distribution. 

The epistemic uncertainty can be obtained as the difference between the 
approximate predictive posterior entropy (or total entropy) and the average 
uncertainty in predictions (i.e: aleatoric entropy): 


I(y : w) = He(ylx) = Hy(ylx) — Halylx) = Ës (yx) — Egu [Ha (y|x,)] 


Therefore, the mutual information I(y : w) i.e. as a measure of bias-corrected 
epistemic uncertainty, represents the variability in the predictions made by the 
neural network weight configurations drawn from approximate posteriors. It 
derives an estimate of the finite sample bias from the leave-one-out estimators 
of the entropy and reduces bias considerably down to O(n~?) [3]. 

The bias-corrected uncertainty estimation model explains regions of ambigu- 
ous data space or difficult to classify, as data distribution with noise in the 
inputs or model, which was trained with different domain data. Consequently, 
these inputs should be assigned a higher aleatoric uncertainty. As a result, we 
can expect high model uncertainty in these regions. 

Following Gal [5], we define the stochastic versions of Bayesian uncertainty 
using MC-Dropweights, where the class probabilities p(yz, = c | zi, w+, D) with 
wi ~ qlw | D) and W = (u;)#_, along with a set of independent and identically 
distributed (i.i.d.) samples drawn from q(w |, D), can be approximated by the 
average over the MC-Dropweights forward pass. 

We trained the multi-label classification network with all eight classes. We 
dichotomised the network outputs using optimal threshold with Algorithm 1 for 
each cell type, with a 1000 MC-Dropweights forward passes at test time. In these 
detection tasks, p(yz,; >= 0; OptimalThreshold; | £i, w+, D), where 1 marks the 
presence of cell type, is sufficient to indicate the most likely decision along with 
estimated uncertainty. 


3.2 Dataset 


Our main dataset is taken from The Human Protein Atlas project, that maps the 
distribution of all human proteins in human tissues and organs [15]. Here, we used 
high-resolution digital images of immunohistochemically stained testes tissue 
consisting of 8 cell types: spermatogonia, preleptotene spermatocytes, pachytene 
spermatocytes, round/early spermatids, elongated/late spermatids, sertoli cells, 
leydig cells, and peritubular cells, publicly available on the Human Protein Atlas 
version 18 (v18.proteinatlas.org), as shown in Fig. 2: 
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Fig. 2. Examples of proteins expressed only in one cell-type [10] 


Co-occurrence matrix 


10 
Leydig (129) 


Elongated/lateSpermatids (712) 
Pachytene (539) 

Peritubular (41) 

Preleptotene (368) 
Round/earlySpermatids (705) 
Sertoli (106) 


‘Spermatogonia (395) 


00 


Ey 
T 


Elongated/lateSpermatids (712) 
Pachytene (539) 

Peritubular (41) 

Preleptotene (368) 
Round/earlySpermatids (705) 
Sertoli (106) 

Spermatogonia (395) 


Fig. 3. Annotated heatmap of a correlation matrix between cell types 


A relationship was observed between spermatogonia and preleptotene sper- 
matocytes cell types and between round/early spermatids and elongated/late 
spermatids cell types along with Pachytene spermatocytes cells. Figure 3 illus- 
trates the correlation coefficients between cell types. The observable pattern is 
that very few cell types are strongly correlated with each other. 


3.3 Results and Discussions 


We conducted the experiments on Human Protein Atlas datasets to validate the 
proposed algorithm, MC-Dropweights in Multi-Label Classification. 
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Multi-label Classification Model Performance: Model evaluation met- 
rics for multi-label classification are different from those used in multi-class (or 
binary) classification. The performance metrics of multi-label classifiers can be 
classified as label-based (i.e.: it is assumed that labels are mutually exclusive) 
and example-based [16]. In this work, example-based measures (Accuracy score, 
Hamming-loss, F1-Score) and Rank-Loss are used to evaluate the performance 
of the classifiers. 


Table 1. Performance metrics 


%Metrics BR CC (PCC |CFT |CLEMS|MC- MC- 
Dropout | Dropweights 

Hamming loss 0.2445 | 0.2420 0.2420 0.2375 0.2370 |0.207 0.1925 

Rank loss 3.6700 | 3.5740 3.1580 3.2920 3.1120 |2.862 2.626 

F1 score 0.5038 | 0.5184 0.5733 | 0.5373 0.5902 |0.6306 |0.6627 

Avg. accuracy score | 0.4236 0.4389 0.4643 0.4573 0.5052 |0.6150 |0.7067 


In the first experiment, we compared the MC-Dropweights neural network- 
based method with five machine learning MLC algorithms introduced in Sect. 1: 
binary relevance (BR), Classifier Chain (CC), Probabilistic Classifier Chain 
(PCC) and Condensed Filter Tree (CFT), Cost-Sensitive Label Embedding 
with Multi-dimensional Scaling (CLEMS) and the MC-Dropout neural network 
model. Table 1 shows that MC-Dropweights exhibits considerably better perfor- 
mance overall the algorithms, which demonstrates the importance of considering 
the Dropweights in the neural network. 


Cell Type-Specific Predictive Uncertainty: The relationship between 
uncertainty and predictive accuracy grouped by correct and incorrect predic- 
tions is shown in Fig. 4. It is interesting to note that, on average, the high- 
est uncertainty is associated with Elongated/late Spermatids and Round/early 
Spermatids. This indicates that there is some feature which contributes greater 
uncertainty to the Spermatids class types than to the other cell types. 


Cell Type Localization: Estimated uncertainty with Saliency Mapping is a 
simple technique to uncover discriminative image regions that strongly influ- 
ence the network prediction in identifying a specific class label in the image. It 
highlights the most influential features in the image space that affect the pre- 
dictions of the model [1] and visualises the contributions of individual pixels to 
epistemic and aleatoric uncertainties separately. We calculated the class activa- 
tion maps (CAM) [18] using the activations of the fully connected layer and the 
weights from the prediction layer as shown in Fig. 5. 
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Fig. 4. Distribution of uncertainty values for all protein images, grouped by correct 
and incorrect predictions. Label assignment was based on optimal thresholding (Algo- 
rithm 1). For an incorrect prediction, there is a strong likelihood that the predictive 
uncertainty is also high in all cases except for Spermatids. 
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Fig. 5. Saliency maps for some common methods towards model explanation 


4 Conclusion and Discussion 


In this study, a multi-label classification method was developed using deep learn- 
ing architecture with Dropweights for the purposes of predicting cell types- 
specific protein expression with estimated uncertainty, which can increase the 
ability to interpret, with confidence and make models based on deep learning 
more applicable in practice. The results show that a Deep Learning Model with 
MC-Dropweights yields the best performance among all popular classifiers. 

Building truly large-scale, fully-automated, high precision, very high dimen- 
sional, image analysis system that can recognise various cell type-specific protein 
expression, specifically for Elongated/Late Spermatids and Round/early Sper- 
matids remains a strenuous task. The properties in the dataset such as label 
correlations, label cardinality can strongly affect the uncertainty quantification 
in predictive probability performance of a Bayesian Deep learning algorithm in 
multi-label settings. There is no systematic study on how and why the perfor- 
mance varies over different data properties; any such study would be of great 
benefit in progressing multi-label algorithms. 
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Abstract. Convolutional neural networks have been used to achieve a 
string of successes during recent years, but their lack of interpretability 
remains a serious issue. Adversarial examples are designed to deliber- 
ately fool neural networks into making any desired incorrect classifica- 
tion, potentially with very high certainty. Several defensive approaches 
increase robustness against adversarial attacks, demanding attacks of 
greater magnitude, which lead to visible artifacts. By considering human 
visual perception, we compose a technique that allows to hide such adver- 
sarial attacks in regions of high complexity, such that they are impercep- 
tible even to an astute observer. We carry out a user study on classifying 
adversarially modified images to validate the perceptual quality of our 
approach and find significant evidence for its concealment with regards 
to human visual perception. 


1 Introduction 


The use of convolutional neural networks has led to tremendous achievements 
since Krizhevsky et al. [1] presented AlexNet in 2012. Despite efforts to under- 
stand the inner workings of such neural networks, they mostly remain black boxes 
that are hard to interpret or explain. The issue was exaggerated in 2013 when 
Szegedy et al. [2] showed that “adversarial examples” — images perturbed in such 
a way that they fool a neural network — prove that neural networks do not simply 
generalize correctly the way one might naively expect. Typically, such adversarial 
attacks change an input only slightly, but in an adversarial manner, such that 
humans do not regard the difference of the inputs relevant, but machines do. 
There are various types of attacks, such as one pixel attacks, attacks that work 
in the physical world, and attacks that produce inputs fooling several different 
neural networks without explicit knowledge of those networks [3-5]. 
Adversarial attacks are not strictly limited to convolutional neural networks. 
Even the simplest binary classifier partitions the entire input space into labeled 
regions, and where there are no training samples close by, the respective label 
can only be nonsensical with regards to the training data, in particular near 
decision boundaries. One explanation of the “problem” that convolutional neu- 
ral networks have is that they perform extraordinarily well in high-dimensional 
settings, where the training data only covers a very thin manifold, leaving a lot 
of “empty space” with ragged class regions. This creates a lot of room for an 
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(a) original input 


a 
(d) original input (e) modified input, BIM) (f) perturbation, BIM) 
(enlarged) (enlarged) (enlarged) 


(g) original input 
= P 


q t 
(j) original input (k) modified input, EbIM (1) perturbation, EbIM 
(enlarged) (enlarged) (enlarged) 


Fig. 1. Two adversarial attacks carried out using the Basic Iterative Method (first two 
rows) and our Entropy-based Iterative Method (last two rows). The original image (a) 
(and (g)) is correctly classified as umbrella but the modified images (b) and (h) are 
classified as slug with a certainty greater than 99%. Note the visible artifacts caused 
by the perturbation (c), shown here with maximized contrast. The perturbation (i) 
does not lead to such artifacts. (d), (e), (£), (j), (k), and (1) are enlarged versions of 


the marked regions in (a), (b), (c), (g), (h), and (i), respectively. 
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attacker to modify an input sample and move it away from the manifold on 
which the network can make meaningful predictions, into regions with nonsen- 
sical labels. Due to this, even adversarial attacks that simply blur an image, 
without any specific target, can be successful [6]. There are further attempts at 
explaining the origin of the phenomenon of adversarial examples, but so far, no 
conclusive consensus has been established [7-10]. 

A number of defenses against adversarial attacks have been put forward, 
such as defensive distillation of trained networks [11], adversarial training [12], 
specific regularization [9], and statistical detection [13-16]. However, no defense 
succeeds in universally preventing adversarial attacks [17,18], and it is possible 
that the existence of such attacks is inherent in high-dimensional learning prob- 
lems [6]. Still, some of these defenses do result in more robust networks, where 
an adversary needs to apply larger modifications to inputs in order to success- 
fully create adversarial examples, which begs the question how robust a network 
can become and whether robustness is a property that needs to be balanced 
with other desirable properties, such as the ability to generalize well [19] or a 
reasonable complexity of the network [20]. 

Strictly speaking, it is not entirely clear what defines an adversarial example 
as opposed to an incorrectly classified sample. Adversarial attacks are devised to 
change a given input minimally such that it is classified incorrectly — in the eyes 
of a human. While astonishing parallels between human visual information pro- 
cessing and deep learning exist, as highlighted e. g. by Yamins and DiCarlo [21] 
and Rajalingham et al. [22], they disagree when presented with an adversarial 
example. Experimental evidence has indicated that specific types of adversarial 
attacks can be constructed that also deteriorate the decisions of humans, when 
they are allowed only limited time for their decision making [23]. Still, human 
vision relies on a number of fundamentally different principles when compared 
to deep neural networks: while machines process image information in parallel, 
humans actively explore scenes via saccadic moves, displaying unrivaled abilities 
for structure perception and grouping in visual scenes as formalized e.g. in the 
form of the Gestalt laws [24-27]. As a consequence, some attacks are perceptible 
by humans, as displayed in Fig.1. Here, humans can detect a clear difference 
between the original image and the modified one; in particular in very homoge- 
neous regions, attacks lead to structures and patterns which a human observer 
can recognize. We propose a simple method to address this issue and answer the 
following questions. How can we attack images using standard attack strategies, 
such that a human observer does not recognize a clear difference between the 
modified image and the original? How can we make use of the fundamentals of 
human visual perception to “hide” attacks such that an observer does not notice 
the changes? 

Several different strategies for performing adversarial attacks exist. For a 
multiclass classifier, the attack’s objective can be to have the classifier predict 
any label other than the correct one, in which case the attack is referred to as 
untargeted, or some specifically chosen label, in which case the attack is called 
targeted. The former corresponds to minimizing the likelihood of the original 
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label being assigned; the latter to maximizing that of the target label. Moreover, 
the classifier can be fooled into classifying the modified input with extremely high 
confidence, depending on the method employed. This, in particular, can however 
lead to visible artifacts in the resulting images (see Fig. 1). After looking at a 
number of examples, one can quickly learn to make out typical patterns that 
depend on the classifying neural network. In this work, we propose a method for 
changing this procedure such that this effect is avoided. 

For this purpose, we extend known techniques for adversarial attacks. A par- 
ticularly simple and fast method for attacking convolutional neural networks is 
the aptly named Fast Gradient Sign Method (FGSM) [4,7]. This method, in its 
original form, modifies an input image x along a linear approximation of the 
objective of the network. It is fast but limited to untargeted attacks. An exten- 
sion of FGSM, referred to as the Basic Iterative Method (BIM) [28], repeatedly 
adds small perturbations and allows targeted attacks. Moosavi-Dezfooli et al. 
[29] linearize the classifier and compute smaller (with regards to the £, norm) 
perturbations that result in untargeted attacks. Using more computationally 
demanding optimizations, Carlini and Wagner [17] minimize the 00, Z2, or Co 
norm of a perturbation to achieve targeted attacks that are still harder to detect. 
Su et al. [3] carry out attacks that change only a single pixel, but these attacks 
are only possible for some input images and target labels. Further methods exist 
that do not result in obvious artifacts, e. g. the Contrast Reduction Attack [30], 
but these are again limited to untargeted attacks — the input images are merely 
corrupted such that the classification changes. None of the methods mentioned 
here regard human perception directly, even though they all strive to find imper- 
ceptibly small perturbations. Schénherr et al. [31] successfully do this within the 
domain of acoustics. 

We rely on BIM as the method of choice for attacks based on images, because 
it allows robust targeted attacks with results that are classified with arbitrarily 
high certainty, even though it is easy to implement and efficient to execute. Its 
drawbacks are the aforementioned visible artifacts. To remedy this issue, we 
will take a step back and consider human perception directly as part of the 
attack. In this work, we propose a straightforward, very effective modification 
to BIM that ensures targeted attacks are visually imperceptible, based on the 
observation that attacks do not need to be applied homogeneously across the 
input image and that humans struggle to notice artifacts in image regions of high 
local complexity. We hypothesize that such attacks, in particular, do not change 
saccades as severely as generic attacks, and so humans perceive the original image 
and the modified one as very similar — we confirm this hypothesis in Sect. 3 as 
part of a user study. 


2 Adversarial Attacks 


Recall the objective of a targeted adversarial attack. Given a classifying convo- 
lutional neural network f, we want to modify an input x, such that the network 
assigns a different label f(x’) to the modified input x’ than to the original z, 
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where the target label f(x’) can be chosen at will. At the same time, x’ should 
be as similar to x as possible, i.e. we want the modification to be small. This 
results in the optimization problem: 


min ||z’— || such that f(x) =y# f(x), (1) 


where y = f(z’) is the target label of the attack. BIM finds such a small pertur- 
bation x’ — x by iteratively adapting the input according to the update rule 


x—«x—e-sign[V.dJ (2, y)| (2) 


until f assigns the label y to the modified input with the desired certainty, 
where the certainty is typically computed via the softmax over the activations 
of all class-wise outputs. sign[V,J(x, y)] denotes the sign of the gradient of the 
objective function J(x,y), and is computed efficiently via backpropagation; € 
is the step size. The norm of the perturbation is not considered explicitly, but 
because in each iteration the change is distributed evenly over all pixels/features 
in a, its é..-norm is minimized. 


2.1 Localized Attacks 


The main technical observation, based on which we hide attacks, is the fact that 
one can weigh and apply attacks locally in a precise sense: During prediction, a 
convolutional neural network extracts features from an input image, condenses 
the information contained therein, and conflates it, in order to obtain its best 
guess for classification. Where exactly in an image a certain feature is located 
is of minor consequence compared to how strongly it is expressed [32,33]. As a 
result, we find that during BIM’s update, it is not strictly necessary to apply the 
computed perturbation evenly across the entire image. Instead, one may choose 
to leave parts of the image unchanged, or perturb some pixels more or less than 
others, i.e. one may localize the attack. This can be directly incorporated into 
Eq. (2) by setting an individual value for e for every pixel. 

For an input image x € [0,1]"*"** of width w and height h with c color 
channels, we formalize this by setting a strength map £ € (0, Ni that holds 
an update magnitude for each pixel. Such a strength map can be interpreted as 
a grayscale image where the brightness of a pixel corresponds to how strongly 
the respective pixel in the input image is modified. The adaptation rule (2) of 
BIM is changed to the update rule 


Tijk — Tijk — €° Eijk ; sign[V,J(2, y)] (3) 


for all pixel values (i, j, k). In order to be able to express the overall strength of 
an attack, for a given strength map € of size w by h, we call 


Ji jeux Eii 


se w-h 


(4) 
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(g) modified input (h) modified input (i) modified input 
(enlarged) (enlarged) (enlarged) 


Fig. 2. Localized attacks with different relative total strengths. The strength maps 
(d), (e), and (f), which are based on Perlin noise, scaled such that the relative total 
strength is 0.43, 0.14, and 0.04, are used to create the adversarial examples in (a), 
(b), and (c), respectively. In each case, the attacked image is classified as slug with a 
certainty greater than 99%. The attacks took 14, 17, and 86 iterations. (g), (h), and 
(i) are enlarged versions of the marked regions in (a), (b), and (c). 


the relative total strength of E, where for n € N we let 7 = {1,...,n} denote the 
set of natural numbers from 1 to n. In the special case where € only contains 
either black or white pixels, «(€) is the ratio of white pixels, i.e. the number of 
attacked pixels over the total number of pixels in the attacked image. 

As long as the scope of the attack, i.e. «(E), remains large enough, adversarial 
attacks can still be carried out successfully — if not as easily — with more iterations 
required until the desired certainty is reached. This leads to the attacked pixels 
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being perturbed more, which in turn leads to even more pronounced artifacts. 
Given a strength map €, it can be modified to increase or decrease K(E) by 
adjusting its brightness or by applying appropriate morphological operations. 
See Fig.2 for a demonstration that uses pseudo-random noise as a strength 
map. 


2.2 Entropy-Based Attacks 


The crucial component necessary for “hiding” adversarial attacks is choosing 
a strength map € that appropriately considers human perceptual biases. The 
strength map essentially determines which “norm” is chosen in Eq. (1). If it 
differs from a uniform weighting, the norm considers different regions of the 
image differently. The choice of the norm is critical when discussing the visibility 
of adversarial attacks. Methods that explicitly minimize the @, norm of the 
perturbation for some p, only “accidentally” lead to perturbations that are hard 
to detect visually, since the £, norm does not actually resemble e. g. the human 
visual focus for the specific image. We propose to instead make use of how 
humans perceive images and to carefully choose those pixels where the resulting 
artifacts will not be noticeable. 

Instead of trying to hide our attack in the background or “where an observer 
might not care to look”, we instead focus on those regions where there is high 
local complexity. This choice is based on the rational that humans inspect images 
in saccadic moves, and a focus mechanism guides how a human can process highly 
complex natural scenes efficiently in a limited amount of time. Visual interest 
serves as a selection mechanism, singling out relevant details and arriving at an 
optimized representation of the given stimuli [34]. We rely on the assumption 
that adversarial attacks remain hidden if they do not change this scheme. In 
particular, regions which do not attract focus in the original image should not 
increase their level of interest, while relevant parts can, as long as the adversarial 
attack is not adding additional relevant details to the original image. 

Due to its dependence on semantics, it is hard — if not impossible — to agnos- 
tically compute the magnitude of interest for specific regions of an image. Hence, 
we rely on a simple information theoretic proxy, which can be computed based 
on the visual information in a given image: the entropy in a local region. This 
simplification relies on the observation that regions of interest such as edges typ- 
ically have a higher entropy than homogeneous regions and the entropy serves 
as a measure for how much information is already contained in a region — that 
is, how much relative difference would be induced by additional changes in the 
region. 

Algorithmically, we compute the local entropy at every pixel in the input 
image as follows: After discarding color, we bin the gray values, i.e. the inten- 
sities, in the neighborhood of pixel ¿i,j such that B; j contains the respective 
occurrence ratios. The occurrence ratios can be interpreted as estimates of the 
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intensity probability in this neighborhood, hence the local entropy 5;,; can be 
calculated as the Shannon entropy 


Si; =X plogp. (5) 


pEBi,j 


Through this, we obtain a measure of local complexity for every pixel in the 
input image, and after adjusting the overall intensity, we use it as suggested 
above to scale the perturbation pixel-wise during BIM’s update. In other words, 
we set 


E = ¢(5) (6) 


where ¢ is a nonlinear mapping, which adjusts the brightness. The choice of a 
strength map based on the local entropy of an image allows us to perform an 
attack as straightforward as BIM, but localized, in such a way that it does not 
produce visible artifacts, as we will see in the following experiments. 

While we could attach our technique to any attack that relies on gradients, 
we use BIM because of the aforementioned advantages including simplicity, ver- 
satility, and robustness, but also because as the direct successor to FGSM we 
consider it the most typical attack at present. As a method of performing adver- 
sarial attacks, we refer to our method as the Entropy-based Iterative Method 
(EbIM). 


3 A Study of How Humans Perceive Adversarial 
Examples 


It is often claimed that adversarial attacks are imperceptible!. While this can 
be the case, there are many settings in which it does not necessarily hold 
true — as can be seen in Fig.1. When robust networks are considered and an 
attack is expected to reliably and efficiently produce adversarial examples, vis- 
ible artifacts appear. This motivated us to consider human visual perception 
directly and thereby our method. To confirm that there are in fact differences 
in how adversarial examples produced by BIM and EbIM are perceived, we con- 
ducted a user study with 35 participants. 


1 We do not want to single out any specific source for this claim, and it should not 
necessarily be considered strictly false, because there is no commonly accepted rig- 
orous definition of what constitutes an adversarial example or an adversarial attack, 
just as it remains unclear how to best measure adversarial robustness. Whether an 
adversarial attack results in noticeable artifacts depends on a multitude of factors, 
such as the attacked model, the underlying data (distribution), the method of attack, 
and the target certainty. 
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3.1 Generation of Adversarial Examples 


To keep the course of the study manageable, so as not to bore our relatively small 
number of participants, and still acquire statistically meaningful (i.e. with high 
statistical power) and comparable results, we randomly selected only 20 labels 
and 4 samples per label from the validation set of the JLSVRC 2012 classification 
challenge [35], which gave us a total of 80 images. For each of these 80 images 
we generated a targeted high confidence adversarial example using BIM and 
another one using EbIM - resulting in a total of 240 images. We set a fixed target 
class and the target certainty to 0.99. We attacked the pretrained Inception v8 
model [36] as provided by keras [37]. We set the parameters of BIM to e€ = 
1.0, stepsize = 0.004 and maz_iterations = 1000. For EbIM, we binarized the 
entropy mask with a threshold of 4.2. We chose these parameters such that the 
algorithms can reliably generate targeted high certainty adversarial examples 
across all images, without requiring expensive per-sample parameter searches. 


3.2 Study Design 


For our study, we assembled the images in pairs according to three different 
conditions: 


(i) The original image versus itself. 
(ii) The original image versus the adversarial example generated by BIM. 
(iii) The original image versus the adversarial example generated by EbIM. 


This resulted in 240 pairs of images that were to be evaluated during the study. 

All image pairs were shown to each participant in a random order — we also 
randomized the positioning (left and right) of the two images in each pair. For 
each pair, the participant was asked to determine whether the two images were 
identical or different. If the participant thought that the images were identical 
they were to click on a button labeled “Identical” and otherwise on a button 
labeled “Different” — the ordering of the buttons was fixed for a given participant 
but randomized when they began the study. To facilitate completion of the study 
in a reasonable amount of time, each image pair was shown for 5s only; the 
participant was, however, able to wait as long as they wanted until clicking on 
a button, whereby they moved on to the next image pair. 


3.3 Hypotheses Tests 


Our hypothesis was that it would be more difficult to perceive the changes in the 
images generated by EbIM than by BIM. We therefore expect our participants 
to click “Identical” more often when seeing an adversarial example generated by 
EbIM than when seeing an adversarial generated by BIM. 

As a test statistic, we compute for each participant and for each of the three 
conditions separately, the percentage of time they clicked on “Identical”. The 
values can be interpreted as a mean if we encode “Identical” as 1 and “Different” 
as 0. Hereinafter we refer to these mean values as prim and ppm. For each of 
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Fig. 3. Percentage of times users clicked on “Identical” when seeing two identical 
images (condition (i), blue box), a BIM adversarial (condition (ii), orange box), or 
an EbIM adversarial (condition (iii), green box). (Color figure online) 


the three conditions, we provide a boxplot of the test statistics in Fig. 3 — the 
scores of EbIM are much higher than BIM, which indicates that it is in fact 
much harder to perceive the modifications introduced by EbIM compared to 
BIM. Furthermore, users almost always clicked on “Identical” when seeing two 
identical images. 

Finally, we can phrase our belief as a hypothesis test. We determine whether 
we can reject the following five hypotheses: 


(1) Ho: uem > Hebim, i.e. attacks using BIM are as hard or harder to perceive 
than EbIM. 

(2) Ho: ugm > 0.5, i.e. whether attacks using BIM are easier or harder to 
perceive than a random prediction 

(3) Ho: Epi < 0.5, i.e. whether attacks using EbIM are easier or harder to 
perceive than a random prediction 

(4) Ho: uem > Nong, i.e. whether attacks using BIM are as easy or easier to 
perceive than identical images. 

(5) Ho: uep > NONE, i.e. whether attacks using EbIM are as easy or easier 
to perceive than identical images. 


We use a one-tailed t-test and the (non-parametric) Wilcoron signed rank 
test with a significance level œ = 0.05 in both tests. The cases (1), (4) and (5) 
are tested as a paired test and the other two cases (2) and (3) as one sample 
tests. 

Because the t-test assumes that the mean difference is normally distributed, 
we test for normality? by using the Shapiro-Wilk normality test. The Shapiro- 
Wilk normality test computes a p-value of 0.425, therefore we assume that the 
mean difference follows a normal distribution. The resulting p-values are listed 
in Table 1 — we can reject all null hypotheses with very low p-values. 


? Because we have 35 participants, we assume that normality approximately holds 
because of the central limit theorem. 
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Table 1. p-values of each hypothesis (columns) under each test (rows). We reject all 
null hypotheses. 


Test Hyp. (1) Hyp. (2) Hyp. (3) | Hyp. (4) Hyp. (5) 
t-test 2.20 x 10718 | 1.03 x 10719 | 2.13 x 107° | 2.20 x 10718 | 2.20 x 10716 
Wilcoxon 1.28 x 1077 |9.10 x 1077 6.75 x 107° | 1.28 x 1077 | 1.28 x 1077 


In order to compute the power of the t-test, we compute the effect size by 
computing Cohen’s d. We find that d ~ 2.29 which is considered a huge effect 
size [38]. The power of the one-tailed t-test is then approximately 1. 

We have empirically shown that adversarial examples produced by EbIM are 
significantly harder to perceive than adversarial examples generated by BIM. 
Furthermore, adversarial examples produced by EbIM are not perceived as dif- 
fering from their respective originals. 


4 Discussion 


Adversarial attacks will remain a potential security risk on the one hand and 
an intriguing phenomenon that leads to insight into neural networks on the 
other. Their nature is difficult to pinpoint and it is hard to predict whether 
they constitute a problem that will be solved. To further the understanding of 
adversarial attacks and robustness against them, we have demonstrated two key 
points: 


— Adversarial attacks against convolutional neural networks can be carried out 
successfully even when they are localized. 

— By reasoning about human visual perception and carefully choosing areas of 
high complexity for an attack, we can ensure that the adversarial perturbation 
is barely perceptible, even to an astute observer who has learned to recognize 
typical patterns found in adversarial examples. 


This has allowed us to develop the Entropy-based Iterative Method (EbIM), 
which performs adversarial attacks against convolutional neural networks that 
are hard to detect visually even when their magnitude is considerable with 
regards to an £,-norm. It remains to be seen how current adversarial defenses 
perform when confronted with entropy-based attacks, and whether robust net- 
works learn special kinds of features when trained adversarially using EbIM. 

Through our user study we have made clear that not all adversarial attacks 
are imperceptible. We hope that this is only the start of considering human 
perception explicitly during the investigation of deep neural networks in general 
and adversarial attacks against them specifically. Ideally, this would lead to a 
concise definition of what constitutes an adversarial example. 
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Abstract. To perform cluster analysis on graphs we utilize graph ker- 
nels, Weisfeiler-Lehman kernel in particular, to transform graphs into 
a vector representation. Despite good results, these kernels have been 
criticized in the literature for high dimensionality and high sensitivity, 
so we propose an efficient subtree distance measure that is subsequently 
used to enrich the vector representations and enables more sensitive dis- 
tance measurements. We demonstrate the usefulness in an application, 
where the graphs represent different source code snapshots, and a cluster 
analysis of these snapshots provides the lecturer an overview about the 
overall performance of a group of students. 


1 Motivation 


Graphs are a universal data structure and have become very popular over recent 
years in various domains with structured data (e.g. protein function prediction, 
drug toxicity prediction, malware detection, etc.). To apply existing clustering 
or classification techniques to graphs, either a distance (or similarity) measure 
is needed, or a transformation into a vector representation for which most clus- 
tering and classification algorithms were developed for. In this paper we are 
concerned about repeatedly clustering graphs to understand the evolution of 
student’s source code. As will be explained in Sect.2, we settle on Weisfeiler- 
Lehman (WL) graph kernels [9] to decompose the graph into subtrees and to 
define a similarity function over the number of common substructures across 
graphs. It has been criticized, however, that WL subtree kernels produce (a) 
many different substructures and thus only a few substructures will be common 
across graphs, which establishes (b) a tendency of being only similar to itself. 
In this paper we propose to include the subtree similarity in an efficient post- 
processing step to tackle both problems: We exploit the fact that many of the 
substructures may be formally distinct but actually quite similar. By enrich- 
ing the vector representations we obtain positive effects for the overall graph 
similarity. 
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Algorithm 1. WLSK(G, li—1) 

Require: graph G = (V, E), label function J;-1 : V — X* 
Ensure: returns new label function l; : V => X* 

1: for v € V do 

2: store node label l;—ı(v) in s 

3 for w E V, (v, w) € E in (some lexicographical) order of l;—1 (w) do 
4: append li—ı(w) to s 

5: end for 
6: 
7 
8: 
9: 


compress s +— h(s) by applying a hash function h 
assign new label to node v : l;(v) — s 

end for 

return l; 


2 Related Work 


2.1 Measuring Similarity Directly 


A common approach to compare graphs is to calculate the edit distance between 
graphs F and G: the minimal number of steps to transform G to F. For the 
special case of trees, these steps consists of node deletion, node insertion, and 
node relabelling. A survey on tree edit distance can be found in [1], an efficient 
algorithmic O(n?) solution, n being the maximal number of nodes in F and G, is 
proposed in [2]. To adapt a tree edit distance to a specific application, there are 
approaches to learn appropriate cost parameters [6]. With general graphs, the 
editing process becomes more complicated as additional operations need to be 
considered (edge insertion and edge deletion). A survey on graph edit distance is 
given in [3]. Its computation is exponential in the number of nodes and therefore 
infeasible for large graphs. 


2.2 Measuring Similarity Indirectly 


Instead of coping with the full graph, one may decompose the graph into a set 
of smaller entities and compare these sets instead of the graphs. These entities 
may be frequent subgraphs (e.g. [8]), walks (short paths), graphlets (e.g. [10]) or 
subtrees (e.g. [9]). Many graph kernel approaches explicitly construct a vector 
representation, where the it” element indicates how often the i*” substructure 
occurs in the graph. From this vector a kernel or similarity matrix may be 
calculated. Recent approaches, such as subgraph2vec [5], use deep learning to 
translate graphs into such a vector representation. 
This section particularly reviews the construction of a WL subtree kernel 
(following [9]), as it will be foundation of the next section. The subtree kernel 
transforms a graph into a vector, where a non-zero entry indicates the occurrence 
of a specific subtree in the graph. The total number of dimensions is determined 
by all subtrees that have been identified in the full set of graphs. 

Given a graph G = (V, E), a label function |: V — X* yields for each node 
v € V a label over a finite alphabet X. The initial labels lọ(v) are provided 
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together with the graph G (original labels). A new label function l; is obtained 
by calling WLS K(G,1;-1), which is shown in Algorithm 1: It constructs new 
labels by concatenating all child labels deterministically (by processing children 
in some lexicographic order). A series of n WLSK calls provides a sequence of 
n label functions lo, ..., ln, where a node label l;(v) takes all children of v up to 
depth i into account. A label l;(v) may thus serve as a kind of fingerprint of the 
neighbourhood of v (hashcode). Let L; = {1},1?,...,1%*} = 1,(V) be the set of 
all different l;-labels in G. The final vector representation of a graph is obtained 
from 


(G) = ict i lt eae 


where #l! denotes how many nodes received the label y . Originally this approach 
was proposed as a test of isomorphism [11], as isomorphic graphs exhibit identical 
substructures (labels). 

Figure 1 shows an illustrative example. On the top left we have two graphs 
Gı and Gə with nodes vı—v7 and vg—v14, resp. The (numeric) label is written in 
the node, the node identifiers are shown in gray. The table next to the graphs 
shows, for each node, how the new label s is constructed from the current node 
label and its successors. For instance, node vı of G; has label 0 and successors 
with labels 2,0,1. Algorithm 1 creates new labels by appending the node label 
and the successor labels (in sorted order), which yields “0 : 0,1,2” for vı. The 
rightmost table shows a dictionary, where each new label (here: 0 : 0,1,2) gets 
a fresh ID (here: 3). Algorithm 1 refers to this step as hashing the node label 
into a new ID (or hashcode) — we use consecutive numbers just for illustrative 
purposes. Children need to be ordered deterministically to get the same hash for 
identical subtrees. The new label lı (v1) = 3 thus encodes a subtree of depth 1 
with root 0 and children 0,1, 2. Once all new labels are determined (lower half 
of Fig.1) the nodes vı and vg still have the same label: lı (v1) = 3 = lı(vs), 
because their subtree of depth 1 was identical. After another WLSK iteration, 
however, the subtrees of depth 2 are no longer identical for vı and vg, so their 
lə-labels are no longer the same: [2(v,;) = 11 4 17 = lIo(vg). The final vector 
representation for G; and G2 (after 2 iterations) consists of counts for each label 
(from all depths): 


(G1) = (4,1,2, 1,1,1,1,2,1,0,0, 1,1,1,1,2,1,0,0,0, 0) 

(G2) = (3, 2,2, 2,0,0,0,2,1,1,1, 0,0, 0,0, 2,1, 1, 1,2, 1) 
Ne ee eee 
Lo- Lı— Lə—label counts 


The vector representation (G) enables us to construct a kernel matrix or apply 
standard clustering and classification directly. 


2.3 Discussion 


Measuring graph similarity indirectly is in general more efficient than direct 
approaches. Among the kernel approaches it has been pointed out that with some 
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Fig. 1. Illustrative example of 2 WLSK iterations. left: initial labels lo, middle: l1, 
right: l2 


substructures, e.g. short paths (aka walks), many different graphs refer to the 
same point at the same point in the feature space (cf. [7]). Subtree kernels (and 
in particular WLSK) have been reported to be efficient! and well-performing in 
subsequent task (e.g. SVM classification). However, from the example in Fig. 1 
we can also acknowledge the critique of the approach: Although G2 has been 
obtained from G4 by removing v4 and adding v2 only, the vector representations 
are very different. Spotting differences early is good when checking for isomorphic 
graphs, but may be less desirable for similarity assessment (e.g. clustering). 
Despite the few changes, more than half of the labels occur exclusively in only 
one of the graphs (13 entries out of 21 that are zero in one of the two graphs). 
Continuous (rather than integer) features may help, as provided by some deep 
learning approaches, but deep learning requires a huge amount of training data, 
which makes them unsuitable for datasets of moderate size. 


3 Enriching WL Subtree Kernels 


Revisiting Fig. 1, node v3 of G4 and node v10 of Gə differ only by a missing node 
labelled ‘1’. From the different 1;-hashcodes for both nodes (5 for v3 and 3 for 
v10) we cannot conclude what they have in common. Secondly, node v2 of G4 and 
vg of Gə are similar in the sense that nodes labelled 0 and 2 can be reached, only 
in G; there is an intermediate node v4. If we accept that node pairs (v2, v9) and 
(v3, Vig) are somewhat similar, this should then positively affect the lz-similarity 
of vı and vg, too. We want to take this kind of similarity into account without 


1 The only necessary data structure is a hash table that collects how often each node 
label occurred. 
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sacrificing the efficiency of WLSK. Instead of integer features (subtree counts) 
we introduce continuous features to better reflect a partial matching of subtrees. 
We stick to the WLSK construction, but propose a post-processing step, which 
replaces the zero entries in the vector representation. As many subtrees (with 
different hashcodes) are in fact similar, we obtain highly correlating dimensions 
which are safe to remove and thus reduces the dimensionality. We optionally 
apply dimensionality reduction to arrive at a vector of moderate size. 


3.1 Subtree Similarity 


Given a graph G = (V, E), let Li = (V) be the set of all hashcodes for sub- 
trees of depth i (cf. tables on the right of Fig. 1). The hashcodes compress the 
newly constructed node labels, but no longer contain any information about 
the subtree. So we track this information in tables: For all occurred hashcodes 
h € L;i, we denote the root node label by ra € Li—ı and the multiset of successor 
labels by Sp C Li—ı. (Example: For h = 11 € Lə in Fig. 1 we have rẹ, = 3 and 
Sn = {4,5,7}.) 

Next we define a series of distance functions d; : L; x L; — R to capture 
the distance between subtree hashcodes of the same depth 7. We start with a 
distance do for the original graph node labels. In absence of any background 
knowledge we use for the initial level 


0 ifh=h' 
1 otherwise ’ 


aoth) = { (1 
but generally assume that some background information can be provided to 
arrive at meaningful distances for the initial node labels. 

For non-trivial subtrees (that is, 1 > 0) we recursively define distance func- 
tions d;(h, h’). It is natural to define the distance as the sum of distances between 
root and child nodes. This requires to assign child nodes of h uniquely to child 
nodes of h’, which is provided by a bijective function f : Sn > Sy: 


dilh, h) := dja thy TH di—ı(k 2 
(h, h’) Wond t egy Do BH) — 2) 


root node distance 


distance of best subtree alignment 


Here B(S,T) denotes the set of bijective functions f : S — T. The first term 
measures the distance between the root node labels and the second term identifies 
the minimal distance among all node assignments. Finding the assignment with 
minimal distance is known as the assignment problem, which has well-known 
solutions and we adopt the Munkres algorithm for this task [4]. 

We are likely to deal with unbalanced assignments, that is, different numbers 
of children for h and h’. A bijective assignment requires |S),| = |S} |, so we add 
the necessary number of missing nodes (denoted by L) to the smaller multiset.? 


? More formally B(S,T) is the set of bijective functions f : S’ > T’ where |S’| = k = 
|T"|, SCS’, TCT", S’ has k—|S| (and T’ has k — |T|) additional L elements. 
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Fig. 2. Left: A priori distances do between labels of Lo. Case (i): Assignment matrix for 
dı distance of lı (v2) = 4 and lı (vo) = 9. Case (ii): Assignment matrix for dı distance of 
v3 ({0,2}) and vio ({0,1,2}). Right: Derived d,-distances from case (i) and (ii). Case 
(iii): Assignment matrix for dz distance of vo ({4,5,7}) and vg ({3,7,9}) (Color figure 
online) 


We extend the distance dp to the case of missing nodes, which corresponds to an 
additional row/column in the do-matrix (see dọ example matrix in Fig. 1(left)). 
Again, these |-distances may be an arbitrary constant or specifically provided 
for each label h € Lo using background knowledge. Then Eq. (2) extends natu- 
rally to L-values: 


dilh, L) = dj_i(ra, L) + XO di_a(k, L) (3) 
kESh 


Figure 2 shows an example. The leftmost table shows the do-distances between 
original node labels (cf. Fig. 1: Lo = {0,1,2}), including the case of a missing 
label L. For the sake of illustration we assume a distance of s for the label pair 
(0,2). Consider the comparison of v2 and vg for depth-1 subtrees: dı (h, h’) with 
h = (v9), h’ = lı (vg). Both root nodes are identical (r;, = rw = 0), but the 
multisets of successors are not (Sp = {0,0}, Sw = {0,2}). Matrix (i) shows the 
distance matrix for the assignment problem: all nodes of h’ (rows) have to be 
assigned to a node of h (columns). As the child nodes represent [o-hashcodes, 
we take the distances from the dp table. An optimal assignment is marked in 
red and we obtain a distance d)(h,h’) = 0+ (0+ 4) = 4. Matrix (ii) shows a 
second example for the dı comparison of v3 vs Ujg: AS v1ọ has three children 
but v3 only two, we introduce one L-element to obtain a square matrix. The 
optimal assignment is shown in red, the d,-distance becomes 1.0. Both examples 
contribute two values to the d,-distance (fourth matrix), from which we may 
then calculate, e.g., do(l2(v1),l2(vs)) = 0 + (4 +1 +0) = 1.5 (matrix (iii). 


3.2 Updating Vector Representations 


Once the WLSK algorithm has been executed, we determine all d;-distances from 
the 1;-labels alone (without revisiting the graphs). Then we update the vector 


254 F. Hoppner and M. Jahnke 


Fig. 3. Insertion of nodes to compensate side-effects of superfluous nodes. (Color figure 
online) 


representations of all graphs, the zero entries in particular. Suppose x is a vector 
representation of G and Xp = 0 for some h € Li, which means that subtree h is 
not present in G. Among the subtrees that do occur in G we can now find the 
one most similar to h’ € L; (smallest distance d;(h, h’)) and replace x, by 


Xp — k(d,(h, h’)) a 


where k : Rt — [0,1] is a monotonically decreasing function that turns distances 
into similarities with k(0) = 1. The multiplication with x», accounts for the fact 
that h’ may occur multiple times in G. We used k(d) = e~ 4/9)”, where 6 is a 
user-defined threshold. 


3.3 Compensating Superfluous Nodes 


We say v is an superfluous node if it is just a stopover on the way to yet another 
node, but does not contribute to the graph structure itself, that is, if the in- 
and out-degree of v is 1. In Fig. 1 the node v4 in G is such a superfluous node. 
In some applications nodes with certain labels may occur occasionally, but do 
not carry any important information. Their existence/absence should therefore 
affect the graph similarity not too much. 

The discussed distance measure can cope with such differences when com- 
paring, e.g., the subtree of və with that of vg. But if we consider v4 as an super- 
fluous intermediate node, it brings another undesired effect: It may introduce 
completely new subtrees which are not present in other graphs. In the example 
of Fig. 1 the node v4 introduces subtrees with hashcodes 6 (at depth 1) and 14 
(at depth 2), which are not present in Gy. When measuring the similarity of G4 
and G2, such subtrees make the graphs appear less similar. 

We address such cases by considering the insertion of a superfluous node in 
our distance calculation. Figure3 shows the situation once more: To enrich the 
vector representation of Go we seek a closest match for label h. According to 
Sect.3.1 we consider, amongst others, the node vg with label h’ as a candidate. 
With both nodes having a single child only, finding the optimal bijective assign- 
ment f is trivial (f(k) = k’) and Eq. (2) boils down to dj_1 (rp, rn) + di-1(k, k’). 
Now we additionally consider the insertion of a superfluous node vs with the 
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same label as v4, as shown in Fig.3 (red). Note that a hashcode l;(vs) for the 
newly inserted node was not necessarily generated earlier. How would the dis- 
tance between a node v4 and v, evaluate? According to (2) we have 


di(li(v), li(vs)) = di-1 (ia (v), liz1 (vs)) + di—a(k, k’) 


The second part consists of a single term because both nodes have a single child 
only. Note that it does not depend on vs. Substituting the first term repeatedly 
by its definition eventually leads us to 


di(li(v), li(vs)) = do(lo(v), lo(vs)) + 3 dj (15 (Kk), t30) (4) 
j=0 


0 by construction 


The level-0-distance to the newly inserted node is 0 by construction, however, we 
replace it by a penalty term d;(Io(v)) to reflect the fact that we had to insert a 
new node. As with do(-,-) we assume that d;(-) can be derived meaningfully from 
the application context: If, for instance, nodes with a certain label h are optional, 
we choose a low insertion distance d7(h) and may otherwise set dz(h) = oo to 
prevent undesired insertions. 

We thus arrive at a distance d} (h, h’) for the insertion of a superfluous node 


a ail, dr(rn)} + Eizo (k), l(k’) if Sn = {k} A Sw = {K} (5) 


oe) otherwise 


which yields co if the prerequisites of a superfluous nodes are not given and 
considers node insertion on both sides (inner min-term). The original distance 
(2) may then be replaced by min{d;(h, h’), dž (h, h’)} to reflect the occurrence of 
superfluous nodes appropriately. These changes can be handled during the pre- 
calculation of the distance matrices, the vector enrichment remains unchanged. 


3.4 Complexity 


Enriching the vector representations requires two steps: (1) The calculation of all 
distance matrices d; requires to calculate >, |L;|? entries. For each entry we have 
to solve an assignment problem, which is O(d? log d) where d is the maximal node 
degree. The method is therefore unattractive for highly connected graphs. But 
many applications with large graphs have a bounded node degree. (2) Secondly, 
the vector representations x of all n graphs need to be enriched. This takes 
O(mz-Mnz) for each graph, where m, (resp. Mnz) is the number of entries in x 
with zero (resp. non-zero) entries: for each 0-entry in x we have to find the most 
similar l-entry. The number of all labels from all graphs (m = $; |£;|) is much 
larger than the number of nodes in a single graph, whereas Mpz is bounded by 
the number of nodes in a single graph. With Mnz << mMm, we may consider Mpz as 
a constant (max. no. of nodes) and arrive at O(n-m) for the vector enrichment. 
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Exercise: Write a function to 
count the number of entries in 
an integer array having a 3 at 
the last digit. 


public static int count3(int [| x){ 

int count=0; 

int 1=0; 

int zaehler=0; 

while (i < x.length) { 
zaehler=x|[i] % 10; 
if (zaehler == 3) { 

++count; 


ee 


Fig. 4. Example of a source code snapshot and its graph representation. The student 
has not yet finished the solution at this stage/snapshot, the return statement is still 
missing. 


4 Application 


We demonstrate the usefulness of the proposed modification in an application 
from computer science education. The increase in the number of CS students 
over the last years calls for tools that help lecturers to assess the stage of devel- 
opment of a whole group of students — rather than inspecting the solutions one 
by one. Our dataset consists of editing streams from the students source code 
editor (for selected exercises of an introductory programming course using Java). 
In our preliminary evaluation we have about 30-50 such streams per task. We 
extract snapshots of the code whenever a student starts to edit a different code 
line than before. (Many snapshot thus do not represent compileable code.) The 
goal is to compare editing paths against each other, for instance, to identify the 
most common paths or outliers. We replace the textual representation of the 
source snapshot by a graph capturing the abstract syntax tree and the variable 
usage, as can be seen in the example of Fig. 4. We want to cluster the snap- 
shots and to construct a new graph where nodes correspond to clusters (of code 
snapshots) and edges indicate editing paths of students. For the experiments we 
applied some preprocessing (e.g. variable renaming in the graph) and assigned 
low insertion costs to expression- and declaration-nodes, because students may 
phrase conditions quite differently. Our use case for superfluous nodes (Sect. 3.3) 
are code blocks ({ }), which are optional if the code within the block consists 
of a single statement only (e.g. the ++count in Fig. 4). 


4.1 Effect on Distances 


To measure the effect of the enriched kernel we have manually subdivided a set of 
snapshots into similar and dissimilar snapshots. In a clustering setting we want 
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Table 1. Effect of vector enrichment on distances. 


Kernel depth d | o Standard vector | Enriched vector 
f f 

2 3- d| 8435476 = 1.50 | HB = 5.44 

3 3.d EEE = 0.99 TEE = 2.93 

4 3. E = 0.58 TAE = 2.01 

5 3. 9.91-8.52 L 0,38 15.07-8.34 = 1,48 
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the modification to carve out clusters more clearly. We therefore compare the 
mean distance uw (and variance cw) within the group of similar graphs against 
the mean distance u, (and variance op) between both groups. By the factor f we 
denote the size of the gap between both means in multiples of the within-group 
standard deviation ow, that is, f = ===. The factor f may be considered as a 
measure of separation between the cluster of similar graphs and the remaining 
graphs. From Table 1 we find that the enriched representation consistently yields 
higher values of f for the enriched than for the standard vector representation. 


4.2 Dimensionality 


New node labels are introduced for every new subtree, which introduces a high 
dimensional vector representation that has been identified as problematic in the 
literature (Sect. 2.3). Enriching the vector representation can help to overcome 
this problem, because labels with minor changes will receive similar (enriched) 
entries. For instance, a dataset with 718 code snapshot graphs generated as 
many as 5179 different subtree labels (depth 3). After enrichment we identified 
the number of attributes that might be removed from the dataset because it con- 
tains a highly correlating attribute already. This leads to a substantial reduction 
in the number of columns: Depending on the Pearson correlation threshold of 
0.9/0.95/0.99 as much as 77% /68%/55% of the attributes can be discarded. 


4.3 Code Graph Clustering 


To reduce the dimensionality further, a principal component analysis (PCA) 
may be applied. Figure5 shows the scatter plot of the principal components 
(PC) #2 against PC #1, #3 and #4 for the standard representation (top) and 
the enriched vectors (bottom). The colors indicate cluster memberships from a 
mean shift clustering over 4 principal components. Note that, by construction of 
the dataset, we do not expect the source code snapshots to fall apart completely 
in well separated clusters, because the data represents the evolution towards a 
final solution, snapshots differ by incremental changes only. In the standard case 
the data scatters more uniformly and less structured (left; PC1 vs PC2), while 
the enriched data shows two long-stretched clusters that reflect a somewhat 
linear code evolution for two different approaches to solve the exercise, which 
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Fig. 5. Principal component #2 versus principal component #1, #3 and #4 for stan- 
dard (left) and enriched (right) vectors. (Color figure online) 


Fig. 6. Snapshot evolution for a group of students: Nodes represent clusters, edges 
represent snapshot transitions. (Color figure online) 


corresponds much better to our expectation. When taking additional component 
into account (PC3), the scatterplot in the middle (PC2 vs PC3) offers a clearer 
structure for the enriched data (e.g. the separation of the curved red cluster at 
the top) than the original data. 

Figure6 shows how the clusters are used in the context of our application. 
Each cluster (like those in Fig.5, but for a different exercise) corresponds to a 
node in this graph. Whenever a student changes the code and thereby moves to a 
different cluster, a (directed) edge is inserted. The number of students who have 
followed a path is written nearby the edge. Clusters that have only one incoming 
and one outgoing edge are not shown for the sake of brevity. The green color 
indicates the degree of unit-test fulfillment. The node labels a : b(c|d) carry 
information about the cluster id a, number of students b that came across this 
node, number of students c (resp. d) who started (resp. ended) in this node. From 
this example the lecturer can immediately recognize that 42 students start in 
cluster #1, from where most students (25) transition to cluster #2 and 10 more 
students reach the same cluster via cluster #4 as an intermediate step. Cluster 
#2 does not yet correspond to a perfect solution, but only 12 students manage 
to reach the green cluster #3 from cluster #2. Other clusters and edges have 
much smaller numbers, they cover exotic solutions or trial-and-error approaches. 
The graph provides a good overview about the students performance as a group. 


5 


Enriched Weisfeiler-Lehman Kernel for Improved Graph Clustering 259 


Conclusions 


Weisfeiler-Lehman subtree kernels can be used to transform graphs into a mean- 
ingful vector representation, but suffer from high dimensionality and sparsity, 
such that the similarity assessment is limited. We overcome both problems by 
taking the subtree distances into account — which are simpler to assess than gen- 
eral tree distance, because only subtrees of equal depth need to be considered. 
Based on the subtree distance we enrich the zero entries of graph vectors and 
improve the similarity assessment. A removal of highly correlating attributes 
reduces the dimensionality considerably. The modifications turned out to be 
advantageous in a use case of source code snapshot clustering. 
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Abstract. Agglomerative clustering methods have been widely used by 
many research communities to cluster their data into hierarchical struc- 
tures. These structures ease data exploration and are understandable 
even for non-specialists. But these methods necessarily result in a tree, 
since, at each agglomeration step, two clusters have to be merged. This 
may bias the data analysis process if, for example, a cluster is almost 
equally attracted by two others. In this paper we propose a new method 
that allows clusters to overlap until a strong cluster attraction is reached, 
based on a density criterion. The resulting hierarchical structure, called 
a quasi-dendrogram, is represented as a directed acyclic graph and com- 
bines the advantages of hierarchies with the precision of a less arbitrary 
clustering. We validate our work with extensive experiments on real data 
sets and compare it with existing tree-based methods, using a new mea- 
sure of similarity between heterogeneous hierarchical structures. 


1 Introduction 


Agglomerative hierarchical clustering methods are widely used to analyze large 
amounts of data. These successful methods construct a dendrogram — a tree 
structure — that enables a natural exploration of data which is very suitable 
even for non-expert users. Various tools offer intuitive top-down or bottom-up 
exploration strategies, zoom-in and zoom-out operations, etc. 

Let us consider the following real-life scenario: a social science researcher 
would like to understand the structure of specific scientific domains based on a 
large corpus of publications, such as dblp or Wiley. A contemporary approach 
is to construct a word embedding [23] of the key terms in publications, that is, 
to map terms into a high-dimensional space such that terms frequently used in 
the same context appear close together in this space (for the sake of simplicity, 
we omit interesting issues such as preprocessing, polysemy, etc.). Identifying 
for example the denser regions in this space directly leads to insights on the 
key terms of Science. Moreover, building a dendrogram of key terms using an 
agglomerative method is typically used [9,14] to organize terms into hierarchies. 
This dendogram (Fig. la) eases data exploration and is understandable even for 
non-specialists of data science. 
© The Author(s) 2020 
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Despite its usefulness, the dendrogram structure might be limiting. Indeed, 
any embedding of key terms has a limited precision, and key terms proximity is 
a debatable question. For example, in Fig. la, we can see that the bioinformatics 
key term is almost equally attracted by biology and computing, meaning that 
these terms appear frequently together, but in different contexts (e.g. different 
scientific conferences). Unfortunately, with classical agglomerative clustering, a 
merging decision has to be made, even if the advantage of one cluster on another 
is very small. Let us suppose that arbitrarily, biology and bioinformatics are 
merged. This may suggest to our analyst (not expert in computer science) that 
bioinformatics is part of biology, and its link to computing may only appear at 
the root of the dendrogram. Clearly, an interesting part of information is lost in 
this process. 

In this paper, our goal is to combine the advantages of hierarchies while 
avoiding early cluster merge. Going back to the previous example, we would like 
to provide two different clusters showing that bioinformatics is closed both to 
biology and computing. At a larger level of granularity, these clusters will still 
collapse, showing that these terms belong to a broader community. This way, 
we deviate from the strict notion of trees, and produce a directed acyclic graph 
that we call a quasi-dendrogram (Fig. 1b). 


biology U computing U bioinformatics > 100 
<_ biology U computing U bioinformatics 100 


D 
DCD RCD 


(a) A classical dendrogram, hiding the ( 
t 


distance 
distance 


early relationship between bioinformatics 
and computing. 


b) A quasi-dendrogram, preserving 
he relationships of bioinformatics. 


Fig. 1. Dendrogram and quasi-dendrogram for the structure of Science. 


Our contributions are the following: 


— We propose an agglomerative clustering method that produces a directed 
acyclic graph of clusters instead of a tree, called a quasi-dendrogram, 

— We define a density-based merging condition to identify these clusters, 

— We introduce a new similarity measure to compare our method with other, 
quasi-dendrogram or tree-based ones, 

— We show through extensive experiments on real and synthetic data that we 
obtain high quality results with respect to classical hierarchical clustering, 
with reasonable time and space complexity. 


The rest of the paper is organized as follows: Sect. 2 describes our proposed 
overlapping hierarchical clustering framework!. Section 3 details our experimen- 


' Source code available at https://gitlab.inria.fr/ijeantet /ohc. 
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tal evaluation. Section 4 presents the related works, while Sect. 5 concludes the 
paper. 


2 Overlapping Hierarchical Clustering 


2.1 Intuition and Basic Definitions 


In a nutshell, our method obtains clusters in a gradual agglomerative fashion 
and in a precise way. At each step, when we increase the neighbourhood of the 
clusters by including more interconnections, we consider the points that fall in 
this connected neighbourhood and we take the decision to merge some of them 
whenever they are connected enough to a cluster using a density criterion X. 
Taking interconnections into account may lead to overlapping clusters. 

More precisely, we consider a set V = {X1,...,Xy} of N points in a n- 
dimensional space, i.e. X; € V C R” where n > 1 and |V| = N. In order to 
explore this space in an iterative way, we consider points that are close up to a 
limit distance ô > 0. We define the 6-neighbourhood graph of V as follows: 


Definition 1 (6-neighbourhood graph). Let V C R” be a finite set of data 
points and E C V? a set of pair of elements of V, let d be a metric on R” and let 
ô > 0 be a positive number. The 6-neighbourhood graph G5(V, E) is a graph with 
vertices labelled with the data points in V, and where there is an edge (X,Y) € E 
between X € V and Y EV if and only if d(X,Y) < ô. 


Property 1. If 6 = 0 then the -neighbourhood graph consists of isolated points 
while if 6 = dmaz, where dmaz is the maximum distance between any two nodes 
in V then G;(V, E) is the complete graph on V. 


Varying 6 will allow to progressively extend the neighbourhood of the vectors 
to form bigger and bigger clusters. Clusters will be formed according to the 
density of a region of the graph. 


Definition 2 (Density). The density [16] dens(G) of a graph G(V, E) is given 
by the ratio of the number of edges of G to the number of edges of G if it were 


a complete graph, that is, dens(G) = ee If |V| =1, dens(G) = 1. 


A cluster is simply defined as a subset of the nodes of the graph and its 
density is defined as the density of the corresponding subgraph. 


2.2 Computing Hierarchies with Overlaps 


Our algorithm, called OHC, computes a hierarchy of clusters that we can identify 
in the data. We call the generated structure a quasi-dendrogram and it is defined 
as follows. 


Definition 3 (Quasi-dendrogram). A quasi-dendrogram is a hierarchical 
structure, represented as a directed acyclic graph, where the nodes are labelled 
with a set of data points, the clusters, such as: 
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— The leaves (i.e. the nodes with 0 in-degree) correspond to the singletons, i.e. 
contain a unique data point. The level of the leaf nodes is 0. 

— There is only one root node (node with 0 out-degree) that corresponds to the 
set of all the data points. 

- Each node (except the root node) has one or more parent nodes. The parent 
relationship corresponds to inclusion of the corresponding clusters. 

- The nodes at a level 6 represent a set of (potentially overlapping) clusters 
that is a cover of all the data points. Also, for each pair of points of a given 
cluster, it exists a path between points of this cluster that have a distance less 
than 6. In other terms, a node contains a part of a connected subgraph of the 
d-neighbourhood graph. 


The OHC method works as presented in Algorithm 1. We first compute the 
distance matrix of the data points (I3). We chose the cosine distance, widely use 
in NLP. Then we construct and maintain the 6-neighbourhood graph G5(V, E), 
starting from ô = 0 (14). 

We also initialize the set of clusters, i.e. the leaves of our quasi-dendrogram, 
with the individual data points (14). At each iteration, we increase ô (16) and 
consider the new added links to the graph (I8) and the impacted clusters (I9). 
We extend these clusters by integrating the most linked neighbour vertices if 
the density does not change more than a given threshold A (110-15). We remove 
all the clusters included in these extended clusters (116) and add the new set of 
clusters to the hierarchy as a new level (118). We stop when all the points are in 
the same cluster which means that we reached the root of the quasi-dendrogram. 

Also to improve the efficiency of this algorithm we use dynamic programming 
to avoid to recompute information related to the clusters like their density and 
the list of their neighbour vertices. It lead to significant improvements in the 
execution time of the algorithm. We will discuss this further in the Sect. 3.3. 


Property 2 (A = 0). When à = 0, each level 6; of a quasi-dendrogram contains 
exactly the cliques (complete subgraphs) of the 46;-neighbourhood graph G's,. 


Property 3 (A = 1). When à = 1, each level 6; of a quasi-dendrogram contains 
exactly the connected subgraphs of the 6;-neighbourhood graph Gs,. 


3 Experimental Evaluation 


3.1 Experimental Methodology 


Tests: The tests we performed were focused on the quality of the hierarchical 
structures produced by our algorithm. To measure this quality we used the classi- 
cal hierarchy produced by SZINK, an optimal single-linkage clustering algorithm 
proposed in Sibson et al. [28], as a baseline. Our goal was to study the behaviour 
of the merging criterion parameter \ that we introduced, as long as its influ- 
ence on the execution time, to verify if for A = 1 we experimentally obtain the 
same hierarchy as SLINK (Property 3) and hence observe the conservative- 
ness of our algorithm. We also compared our method to other agglomerative 
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Algorithm 1. Overlapping Hierarchical Clustering (OHC) 


1: Input: 
—~ V={a1,...,2n}, N data points. 
— à > 0, a merging density threshold. 
: Output: quasi-dendrogram H. 
: Preprocessing: obtain A = (61,...,5m) the distances between data points in increasing order. 
: Initialization: 
— Create the graph G(V, Eo = 9). 
— Set a list of clusters C = [{a1},...,{an}]. 
— Add the list of clusters to the level 0 of H. 


AWN 


5: i=1. 

6: while #C > 1 and i < m do 

T for each pair (u,v) € V? such as d(u, v) = 6; do 

8 Add (u,v) to E5,_,- 

9 Determine the impacted clusters Cimp of C containing either u or v. 

10 for each impacted cluster Cimp; € Cimp do 

Ti; Look for the points {p1,..., pp} that are the most linked to Cimp; in G5,. 
12 Compute the density dens(S;) of the subgraph Sj; = Cimp; U {pi,..-,pk}- 
13 if Sj #Cimp; and |dens(S;) — dens(Cimp,)| < à then 

14 Continue to add the most linked neighbors to Sj the same way if possible. 
15 When S; stops growing remove Cimp; from the list of clusters C and add S; to the 


list of new clusters Cnew- 


16: Remove all cluster of C included in one of the clusters of Chew. 
Le: Concatenate Crew to C. 

18: Add the list of clusters to the level 6; of H. 

19: i=i+1. 


20: return H 


methods such as the Ward variant [29] and HDBSCAN* [8]. To compare such 
structures we needed to create a new similarity measure which is described in 
Sect. 3.2. 


Datasets: To partially see the scalability of our algorithm but also to avoid 
too long running times we had to limit the size of the datasets to few thousand 
vectors. To be able to compare the results, we run the tests on datasets of same 
size that we fixed to 1000 vectors. 


— The first dataset is composed of 1000 randomly generated 2-dimensional 
points. 

— To test the algorithm on real data and in our motivating scenario, the second 
dataset was created from the Wiley collection via their API?. We extracted 
the titles and abstracts of the scientific papers and trained a word embedding 
model on the data of a given period of time by using the classical SGNS 
algorithm from Mikolov et al. [22] following the recommendation of Levy et al. 
[20]. We set the vocabulary size to only 1000 key words per year even though 
this dataset allows us to extract up to 50000 of them. This word embedding 
algorithm created 1000 300-dimensional vectors for each year over 20 years. 


Experimental Setting: All our experiments are done on a Intel Xeon 5 Core 
1.4GHz, running MacOS 10.2 on a SSD hard drive. Our code is developed with 


? https: //onlinelibrary.wiley.com/library-info/resources/text-and-datamining. 
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Python 3.5 and the visualization part was done on a Jupyter NoteBook. We used 
the SLINK and Ward implementations from the scikit-learn python package and 
the HDBSCAN* implementation of McInnes et al. [21]. 


3.2 A Hierarchy Similarity Measure 


As there is no ground truth on the hierarchy of the data we used, we need a sim- 
ilarity measure to compare the hierarchical structures produced by hierarchical 
clustering algorithms. The goal is not only to compare the topology but also the 
content of the nodes of the structure. However up to our knowledge there is very 
little in the literature about hierarchy comparison especially when the structure 
is similar to a DAG or a quasi-dendrogram. Fowlkes and Mallows [19] defined a 
similarity measure per level and the new similarity function we propose is based 
on the same principle. First we construct a similarity between two given levels 
of the hierarchies, and then we extend it to the global structures by exploring 
all the existing levels. 


Level Similarity: Given two hierarchies hı and hg and a cardinality i, we 
assume that it is possible to identify a set lı (resp. l2) of i clusters for a given 
level of hierarchy hı (resp. h2). Then, to measure the similarity between lı and 
Ig, we take the maximal Jaccard similarity among one cluster of lı and every 
clusters of l2. The average of these similarities, one for each cluster of lı, will 
give us the similarity between the two sets. If we consider the similarity matrix 
of hı and hg with a cluster of lı for each row, a cluster of lə for each column and 
the Jaccard similarity between each pair of clusters at the respective coordinates 
in the matrix, we can compute the similarity between lı and lə by taking the 
average of the maximal value for each row. Hence, the similarity function between 
two sets of clusters lı, l2 is defined as: 


sim, (ly, l2) = mean{mazr{J(c1, c2) | co € lo}ler E€ ly} (1) 


where J is the Jaccard similarity function. 

However, taking the maximal value of each row shows how the clusters of 
the first set are represented in the second. If we take the maximal value of 
each column we will see the opposite, i.e. how the second set is represented in 
the first set. Hence with this definition the similarity might not be symmetrical 
so we propose this corrected similarity measure that shows how both sets are 
represented in the other one: 


sim; (11, l2) = mean(simı(lı, l2), simi(le, l1)) (2) 


Complete Similarity: Now that we can compare two levels of the hierarchical 
structures, we can simply average the similarity for each corresponding levels 
of the same size. For classical dendograms, each level has a distinct number of 
clusters so identification of levels is easy. Conversely, our quasi-dendrograms may 
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have several distinct levels (pseudo-levels) with the same number of clusters. If so, 
we need to find the best similarity between these pseudo-levels. For a given level 
(i.e. number of clusters), we want to build a matching M that maps each pseudo- 
level I+, 1?,... of hy to at least one pseudo-level 13, 13, ...of ho and conversely (see 
Fig. 2). This matching M should maximize the similarity between pseudo-levels 
while preserving their hierarchical relationship. That is, for a, b,c, d representing 
the height of pseudo-levels in the hierarchies, if (14,15) € M and (I?,/%) € M, 
then (b > a —> d > c) or (b < a > d < c) (no “crossings” in M, such as 
(0331, 19°) with (1329, 1304), 
To produce this mapping, our sim- hl h2 

ple algorithm is the following. We initial- 
ize M and two pointers with the two 
highest pseudo-levels ((1231,1304), step 1 
of Fig.2). At each step, for each hier- 
archy, we consider current pointers and 
their children, and compute all their sim- 
ilarities (step 2). We then add pseudo- 
levels with maximal similarity to M (here, 
(178°, 13°3)). Whenever a child is chosen, 
the respective pointer advances, and at 
each step, at least one pointer advances. 
Once pseudo-levels have been consumed 
on one side, ending with l, we can fin- 
ish the process by adding (I/,1) to M 
for all remaining pseudo-level I’ on the Piso. Computing; tie similarity 


44 = 230 
other side (here, 1 = Ki i On our between two quasi-dendograms hı 
example, the final matching is M = and hz for levels having the same 


{(17°1, iB”), (13°, 13°), (199; 13°), (n 13°), number of clusters. 
(138, 1900), (2980, 1299) j. 
Finally, from (2) we define the similarity between two hierarchies as 


sim(hy, h2) = mean{ sim; (11, l2)|(li, l2) € (hi, h2) & (11, l2) € My}. (3) 


3.3 Experimental Results 


Expressiveness: With this small following example we would like to present 
the expressiveness of our algorithm compared to classical hierarchical clustering 
algorithms such as SLINK. On the hand-built example shown in Fig. 3a we can 
clearly distinguish two groups of points, {A, B,C, D, E} and {G, H, I, J, K} and 
two points that we can consider as noise, F and L. Due to the chaining effect we 
expect that the SLINK algorithm will regroup the 2 sets of points early in the 
hierarchy while we would like to prevent it by allowing some cluster overlaps. 

Figure 3b shows the dendrogram computed by SLINK and we can see as 
expected that when F merges with the cluster formed by {A, B,C, D, E} the 
next step is to merge this new cluster with {G, H, I, J, K}. 

On the contrary in Fig. 4 that presents the hierarchy built with our method 
for a specific merging criterion, we can see an example of diamond shape that 
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Fig. 3. A hand-built example (a) and its SLINK dendrogram (b). 


is specific to our quasi-dendrogram. For simplicity the view here slightly differs 
from the quasi-dendrogram definition as we used dashed arrows to represent 
the provenance of some elements of a cluster instead of going further down in 
hierarchy to have a perfect inclusion and respect the lattice-like structure. The 
merge between the clusters {A,B,C,D,E} and {G,H,I,J, Kk} is delayed to 
the very last moment and the point F will belong to these 2 clusters instead 
of forcing them to merge. Also depending on the merging criterion we obtain 
different hierarchical structures by merging earlier of later some clusters. 


Merging Criterion: As we 
can see in Fig. 5b when the 
merging criterion increases 
we obtain a hierarchy more 
and more similar to the 
one produced by the classi- 
cal SLINK algorithm until 
we obtain exactly the same 
for a merging criterion of 1. 
Knowing this fact it is also 
normal to have a similar- 
ity between OHC and Ward 
(resp. HDBSCAN*) hierar- 
chies converging to the sim- 
ilarity between SLINK and 
Ward (resp. HDBSCAN*) 
hierarchies. However we can 
notice that the OHC and 
Ward hierarchies are the 
most similar for a merging 
criterion smaller than 1. 


Fig. 4. OHC quasi-dendrogram obtained from the 
hand-built example in Fig. 3a for À = 0.2. 
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Fig. 5. Study of the merging criterion. 


Execution Time: We observe that when the merging criterion increases the 
execution time decreases. It is due to the fact that when the merging criterion 
increases we are more likely to completely merge clusters so we reach faster the 
top of the hierarchy. It means less levels and less overlapping clusters so less 
computation. However in this case we have the same drawback of chaining effect 
as the single-linkage clustering that we wanted to avoid. Even if it was not the 
objective of this work we set A = 0.1, as it is an interesting value according 
to the study of the merging criterion (Fig. 5a), to observe the evolution of the 
execution time (Fig.5a). The trend gives a function in O(n?*°) so to speed 
up the process and scale up our algorithm is it possible to precompute a set of 
possibly overlapping clusters over a given 6-neighbourhood graph with a classical 
method, for instance CLIQUE, and build the OHC hierarchy on top of that. 


4 Related Work 


Our goal is to group together data points represented as vectors in R”. For our 
motivating application domain of understanding the structure of scientific fields, 
it is important to construct structures (i) that are hierarchical, (ii) that allow 
overlaps between the identified groups of vectors and (iii) which groups (clusters) 
are related to dense areas of the data. There are a number of other application 
domains where obtaining a structure with these properties is important. In the 
following, we relate our work to relevant literature. 


Hierarchical Clustering: There exist two kinds of hierarchical clustering. 
Divisive methods follow a top-down strategy while agglomerative techniques 
compute the hierarchy in a bottom-up fashion. It produces the well known den- 
drogram structure [1]. One of the oldest methods is the single-linkage clustering 
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that first appeared in the work of Florek et al. [18]. It had many improvements 
over the years until an optimal algorithm named SLINK proposed by Sibson 
[28]. However the commonly cited drawback of the single-linkage clustering is 
that it is not robust to noise and suffers from chaining effects (spurious points 
merging clusters prematurely). It led to the invention of many variants with their 
advantages and disadvantages. In the NLP world we have for instance the Brown 
clustering |T] and its generalized version [13]. The drawback of choosing the num- 
ber of clusters beforehand present in the original Brown clustering is corrected 
in the generalized version. Researchers also tried to address directly the chaining 
effect problem with approaches through defining new objective functions such as 
the Robust Hierarchical Clustering [4,11]. However these variants do not allow 
any overlaps in the clusters. Other variants tried to allow this fuzzy clustering 
in the hierarchy such as SOHC [10], a hierarchical clustering based on a spatial 
overlapping metric but with a fixed number of clusters, or HCOSM [26], that 
use an overlap similarity measure to merge clusters and then compute a hierar- 
chy from an already determined set of clusters. Generalization of dendrogram to 
more complex structures like Pyramidal Clustering [15] and Weak Hierarchies 
[5] were also proposed. We can find examples to prove that our method produces 
even more general hierarchical structures that include the weak hierarchies. 


Density-Based Clustering: Another important class of work is the density- 
based clustering. Here, clusters are defined as regions in the data that have a 
higher density. The data points in the sparse areas that are required to separate 
clusters are considered as noise or border points. One of the most widely-used 
algorithms of this category is DBSCAN defined by Ester et al. [17]. This method 
connects data points that satisfy a specific density-based criterion: the minimum 
number of other data points within a given radius must be above a predefined 
threshold. The main advantage of this method is that it allows detecting clus- 
ters of arbitrary shapes. More recently improved versions of DBSCAN were 
proposed such as HDBSCAN* [8]. This new variant not only improved notions 
from DBSCAN and OPTICS [3] but also proposed a procedure to extract a 
simplified cluster tree from the reachability relation which allows determining a 
hierarchy of the clusters but again with no overlapping. 


Overlapping Clustering: Fuzzy clustering methods [6] allow that certain data 
points belong to multiple clusters with a different level of confidence. In this 
way, the boundary of clusters is fuzzy and we can talk about overlaps of these 
clusters. In our definition it is a different notion, a data point either does or does 
not belong to a specific cluster and might also belong to multiple clusters. While 
HDBSCAN is closely related to connected components of certain level sets, the 
clusters do not overlap (since overlap would imply the connectivity). 


Community Detection in Networks: A number of algorithmic methods have 
been proposed to identify communities. The first kind of methods produces a 
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partition where a vertex can belong to one and only one community. Following 
the modularity function of Newman and Girvan [24], numerous quality functions 
have been proposed to evaluate the goodness of a partition with a fundamental 
drawback, the now proved existence of a resolution limit. The second kind of 
methods, such as CLIQUE [2], k-clique [25], DBLC [31] or NMF [30], aims 
at finding sets of vertices that respect an edge density criterion which allows 
overlaps but can lead to incomplete cover of the network. Similarly to HCOSM, 
the method EAGLE [27] builds a dendrogram over the set of predetermined 
clusters, here the maximal cliques of the network so overlaps appear only at 
the leaf level. Coscia et al. [12] have proposed an algorithm to reconstruct a 
hierarchical and overlapping community structure of a network, by hierarchically 
merging local ego neighbourhoods. 


5 Conclusion and Future Work 


We propose an overlapping hierarchical clustering framework. We construct a 
quasi-dendrogram hierarchical structure to represent the clusters that is how- 
ever not necessarily a tree (of specific shape) but a directed acyclic graph. In 
this way, at each level, we represent a set of possibly overlapping clusters. We 
experimentally evaluated our method using several datasets and also our new 
similarity measure that hence proved its usefulness. If the clusters present in 
the data show no overlaps, the obtained clusters are identical to the clusters we 
can compute using agglomerative clustering methods. In case of overlapping and 
nested clusters, however, our method results in a richer representation that can 
contain relevant information about the structure of the clusters of the underlying 
dataset. As a future work we plan to identify interesting clusters on the basis 
of the concept of stability. Such methods give promising results in the context 
of hierarchical density-based clustering [21], but the presences of overlaps in the 
clusters requires specific considerations. 
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Abstract. Studying migration using traditional data has some limi- 
tations. To date, there have been several studies proposing innovative 
methodologies to measure migration stocks and flows from social big 
data. Nevertheless, a uniform definition of a migrant is difficult to find 
as it varies from one work to another depending on the purpose of the 
study and nature of the dataset used. In this work, a generic method- 
ology is developed to identify migrants within the Twitter population. 
This describes a migrant as a person who has the current residence dif- 
ferent from the nationality. The residence is defined as the location where 
a user spends most of his/her time in a certain year. The nationality is 
inferred from linguistic and social connections to a migrant’s country of 
origin. This methodology is validated first with an internal gold standard 
dataset and second with two official statistics, and shows strong perfor- 
mance scores and correlation coefficients. Our method has the advantage 
that it can identify both immigrants and emigrants, regardless of the ori- 
gin/destination countries. The new methodology can be used to study 
various aspects of migration, including opinions, integration, attachment, 
stocks and flows, motivations for migration, etc. Here, we exemplify how 
trending topics across and throughout different migrant communities can 
be observed. 
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1 Introduction 


Understanding where migrants are is an important topic because it touches upon 
multidimensional aspects of the sending and receiving countries’ society. It is 
not only the demographic fabric of countries but also labour market conditions, 
as well as economic conditions that may alter due to demographic adjustment. 
Understanding their allocation is essential for both policy makers and researchers 
to bring the best of its effects. 

Official data such as census, survey and administrative data have been tradi- 
tionally the main data source to study migration. However, these data have some 
limitations [12]. They are inconsistent across different nations because countries 
employ different definitions of a migrant. Moreover, collecting traditional data 
is costly and time consuming, thus tracking instantaneous stocks of migrants 
becomes difficult. This becomes even harder when tracking emigrants because 
of the lack of motivation from citizens to declare their departure. 

In recent years, however, we are provided with other alternative data sources 
for migration. The availability of social big data allows us to study social 
behaviours both at large scale and at a granular level, and to peek into real- 
world phenomena. Although known to suffer from other types of issues, such as 
selection bias, these data could bring complementary value to standard statistics. 

Here, we propose a method to identify migrants based on Twitter data, to 
be used in further analyses. According to the official definition, a migrant! is “a 
person who moves to a country other than that of his or her usual residence for 
a period of at least a year”. In the context of Twitter, we define a migrant as “a 
person who has the current residence different from the nationality” . 

Following this definition, we performed a two step analysis. First, we esti- 
mated the current residence for users by examining location information from 
tweets. The residence is defined as the country where the user spends most of 
the time in a year. Second, we estimated nationality, by considering the social 
network of users. In the international literature, nationality is defined as a rela- 
tionship between a state and an individual, with rights and duties on both sides 
[1,6]. Related concepts are ethnicity - in terms of cultural features - and citizen- 
ship - in terms of political life. In this paper, we employ the term nationality 
to define the ensemble of features that make a person feel like they belong to a 
certain country [2,5]. This could be the country where a person was born, raised 
and/or lived most of their lives. By comparing labels of residence and nationality 
of a user, we were able to understand whether the person has moved from their 
home country to a host country, and thus if they are a migrant. We validated 
our estimation internally, from the data itself, and externally, with two official 
datasets (Italian register and Eurostat data). 

One of the advantages of our methodology is that it is generic enough to 
allow for identification of both immigrants and emigrants. We also overcome 
one of the limitations of traditional data by setting up a uniform definition of 


1 Recommendations on Statistics of International Migration, Revision 1 (p. 113). 
United Nations, 1998. 
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a migrant across different countries. Furthermore, our definition of a migrant 
is very close to the official definition. We establish the fact that a person has 
spent a significant period at the current location. Also, we eliminate visitors 
or short-term stays that do not follow the definition of a migrant. This is also 
validated by the comparison with official datasets. Another advantage of our 
method is the fact that it uses only very basic features from the Twitter data: 
location, language and network information. This is useful since the settings of 
the freely available Twitter API change constantly. Some of the user attributes 
that the existing literature use to estimate nationality are no longer available. 
In addition, we make use of unknown locations of tweets by examining whether 
they intersect with identified locations. By doing so, we do not neglect any 
information provided by the tweets from unknown locations which later provide 
useful information on trending topics of Italian emigrants overseas. 

One of the issues with our method is that the migrants that we observed are 
selected from the Twitter population, and not from the general world population, 
and it is known that some demographic groups are missing. Nevertheless, we 
believe that studying the Twitter migrant population can provide important 
insight into migration phenomena, even if some findings may not apply to the 
other demographic groups that are not represented in the data. 

It is important to note that tracking individual migrants is not the objective 
of our study, but it is only an intermediate stage to enable further analyses. 
We simply perform user classification to identify migrants among users in our 
data, and then aggregate the findings. Further studies we envision are aimed at 
devising new population-level indices useful to evaluate and improve the quality 
of life of migrants, through targeted evidence-based policy making. No individ- 
ual personal information nor migration status is released at any stage during 
the current analysis, nor in any population-level analysis, which is performed 
following the highest ethical and privacy standards. 

The rest of the paper is organised as follows. In the next section we describe 
related work that studies migration using big data. In Sect.3, we provide details 
of the experimental setting for data collection as well as data pre-processing. 
We then explain our identification strategy for both residence and nationality in 
Sect. 4. In Sect.5, we evaluate our estimation using both internal and external 
data. Section 6 covers a possible application of our method on studying trending 
topics among Italian emigrants, while Sect. 7 concludes the paper. 


2 Related Work 


In the past few years, there have been several works on migration studies using 
social big data. Most of these employed Twitter data but Facebook, Skype, Email 
as well as Call Detail Record (CDR) data have also been used to study both 
international and internal migration [3,9,10,14,16]. Here, we focus on studies 
that have employed freely available data. The definition of a migrant varied from 
one work to another depending on the purpose of the study and the nature of 
the dataset. Thus, the definitions provided fit under different types of migration 
such as refugees, internal migrants, seasonal migrants or even visitors. 
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One example of using Twitter to observe migration flows is [15]. They defined 
residence as the country where the tweets were most frequently sent out for 
periods of four months. If one’s residence changed in the following four months 
period, it was considered that the person has moved. In a more recent work, 
[11] measure migration flows from Venezuela to neighbouring countries between 
2015 and 2019. They look at the bounding boxes and country labels provided by 
the tweets and identified the most common country of tweets posted monthly. 
Their definition of a migrant was “any individual leaving Venezuela during the 
time window of observation” which was observed when an identified Venezuelan 
resident appeared for the first time in a different country. Our definition of 
residence is somewhat similar to these works. However, unlike them, we are 
measuring stocks of migrants, and not flows. Thus, we take into account the 
aspect of duration of stay. This naturally eliminates short-term trips and visits. 

Apart from geo-tagged tweets, there is other information provided by the 
Twitter API that can help us infer whether a person is a migrant or not. 
Although [8] did not directly study migrants, but looked at foreigners present 
in Qatar, it provides important insights to which of the features provided by 
Twitter is useful in identifying nationality of users. They gathered features from 
both profile and tweets of users. For features providing information on profile 
pictures and name, they performed facial recognition and name ethnicity detec- 
tion. Their final results showed that ethnicity of name, race, language of tweet, 
language of mention, location of followers and friends are the first six features 
that are useful. In this paper, we purely employ data provided by Twitter for 
the analysis and therefore, we do not have name, ethnicity and race features. 
Nevertheless, our work also shows that locations of users and friends are the use- 
ful features. The difference here is that we propose to use the social network of 
users as one of the main features in identifying nationality, which is more flexible 
than having to perform ethnicity detection on names and profile pictures. 


3 Experimental Setting for Data Collection 


We began with a Twitter dataset collected by the SoBigData.eu Laboratory [4]. 
We started from a three months period of geo-tagged tweets from August to 
October 2015. Due to our focus on Italy, we selected from these data the users 
that tweeted from Italy, obtaining thus 34,160 users. We then crawled the net- 
work of geo-enabled friends of these 34,160 users, using the Twitter API. Friends 
are people that the individual users are following. We focused on friends because 
we believe that for a user, the information on whom they follow is more infor- 
mative when it comes to nationality, than who they are followed by. We concen- 
trated on geo-enabled friends because geo-location is necessary for our analysis. 
By collecting friends, the list of users crossed our initial geographic boundary, 
i.e., Italy. At this stage, the number of unique users grew to over 250,000. For 
all users we also scraped the profile information and the 200 most recent tweets 
using the Twitter API. During this process, we were able to collect all 200 recent 
tweets for 97% of users and at least 55 tweets for 99% of users. Our final user 
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network consists of 258,455 nodes and 1,205,133 edges which includes both our 
initial 34,160 users and their geo-tagged friends. 

For the process of identifying migration status, we focus on the core users, i.e., 
34,160 users. We assign a residence and a nationality to each user, based on the 
geo-locations included in the data, the language of tweets and profile information. 
The final dataset includes 237 unique countries from where individuals have sent 
out their tweets, including ‘undefined’ location. Even if a user enables geo-tags 
on their tweets, not all tweets are geo-tagged. As a result, 21% of our tweets are 
‘undefined’. As for the languages, there are 66 unique languages and 12% of our 
tweets are in English. 


Number of days observed in the dataset 2018 Number of tweets in 2018 


Fig. 1. Distribution of the number of days (left) and the number of tweets (right) 
observed in the data per user: on average, our users have tweeted 47 days and 82 
tweets in 2018. 


As for the profile features, we observe that 40% of the users have filled out 
location description. In addition, most of users have set their profile language 
to English. The number of unique profile languages detected in our data is 58 
which is smaller than the languages used, indicating that some users are using 
languages different from their profile language when tweeting. 

In order to assign a place of residence to users, we needed to restrict the 
observation time period. We have chosen to look at one year length of tweets from 
2018, in order to assign the residence label for the 2018 solar year. We selected 
users that have tweeted in 2018, identifying 128,305 users. To remove bots, we 
looked at whether a user is tweeting too many times a day. We considered that 
tweeting more than 50 tweets on average in a single day was excessive and we 
have eliminated in this way 39 users. In addition, we removed users that were 
not very active in 2018. If the number of tweets was less than 20, we checked 
whether the tweeted days were spread out during the year. If the days were not 
well spread out, we filtered out the user. On the other hand, if it was well spread 
out, it meant that the user was regularly tweeting, so the user was kept. During 
this process, we removed 10,764 users. After removing bots and inactive users, 
we have 117,502 users. For these, we show the distribution of the number of 
tweets and number of days in which they tweeted in Fig. 1. On average we see 
47 days and 82 tweets. 
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In addition to the Twitter data, we also collected a list of official and spoken 
languages for countries identified in our data?. 


4 Identifying Migrants 


A migrant is a person that has the residence different from the nationality. We 
thus consider our core 34,160 Twitter users and assign a residence and nationality 
based on the information included in our dataset. The difference between the two 
labels will allow us to detect individuals who have migrated and are currently 
living in a place different from their home country. The methodology we propose 
is based on a series of hypotheses: a person that has moved away from their 
home country stays in contact with their friends back in the home country and 
may keep using their mother tongue. 


4.1 Assigning Residence 


In order for a place to be called residence, a person has to spend a considerable 
amount of time at the location. Our definition of residence is based on the amount 
of time in which a Twitter user is observed in a country for a given solar year. 
More precisely, a residence for each user is the country with the longest length of 
stay which is calculated by taking into account both the number of days in which 
a user tweets from a country but also the period between consecutive tweets in 
the same country. In this work we compute residences based on 2018 data. 

To compute the residence, we first compute the number of days in which we 
see tweets for each country for each user. If the top location is not ‘undefined’, 
then that is the location chosen as residence. Otherwise, we check whether any 
tweet sent from ‘undefined’ country was sent on a same day as tweets sent 
from the second top country. In case at least one date matched between the 
two locations, we substitute second country as the user’s place of residence. On 
average, 5 dates matched. This is done under the assumption that a user cannot 
tweet from two different countries in a day. Although this is not always the case if 
a user travels, in most of the days of the year this should be true. This approach 
allowed us to assign a residence in 2018 to 57,180 users. 

For the remaining 60,322 users, a slightly different approach was imple- 
mented. We computed the length of stay in days by adding together the duration 
between consecutive tweets in the same country. We selected the country with 
the largest length of stay. In case the top country was ‘undefined’, we checked 
whether ‘undefined’ locations were in between segments of the second top coun- 
try, in which case the second country was chosen. In this way, an additional 
11,046 users were assigned a place of residence. The remaining 49,276 users were 
neglected because we considered that we did not have enough information to 
assign a residence. 


? Retrieved from http://www.geonames.org and https://www.worlddata.info. 
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4.2 Assigning Nationality 


In order to estimate nationalities for Twitter users, we took into account two 
types of information included in our Twitter data. The first type relates to the 
users themselves, and includes the countries from which tweets are sent and the 
languages in which users tweet. For each user u we define two dictionaries loc" 
and lang” where we include, for each country and language the proportion of 
user tweets in that country/language. 


loc"'={France: 0.2, Italy:0.8} — loc"*={Italy:0.1,Korea:0.9} loc={Korea:1} 
lang"! ={Italian:1} lang'*={Korean:1} lang**={Korean:1} 


loc¥'={France: 0.1, Italy:0.8, Korea:0.1} 
lang’'={French:0.2, Italian:0.1, Korean:0.7} 
floc’'={France:0.066, Italy:0.3, Korea:0.633} 
flang’'={Italian:0.33,Korean:0.66} 


Fig. 2. Example of calculation of the floc and flang values for a user. The calculation 
of floc’! and flang”! is based of the floc and flang values for the three friends, 
showing the distribution of tweets in various countries/languages for each. 


The second type of information used is related to the user’s friends. Again, 
we look at the languages spoken by friends, and locations from which friends 
tweet. Specifically, starting from the loc and lang dictionaries of all friends of 
a user, we define two further dictionaries floc and flang. The first stores all 
countries from where friends tweet, together with the average fraction of tweets 
in that country, computed over all friends: 


oc" 2 oc! 
ftll = gy Do tere (1 


where F(u) is the set of friends of user u. Similarly, the flang dictionary stores all 
languages spoken by friends, with the average fraction of tweets in each language 
bi 
flang"[] = —— Y lang! [Il (2) 
FO A, 

Figure 2 shows an example of a (fictitious) user with their friends, and the four 
resulting dictionaries. 

The four dictionaries defined above are then used to assign a nationality score 
to each country C for each user u: 


NÈ =Wlocloc" [C] + Wlang 5 lang" [J+ (3) 


l€languages(C) 


W floc floc“ [C] + Wfiang 5 flang” [I] (4) 
lElanguages(C) 


Digital Footprints of International Migration on Twitter 281 


where languages(C) are the set of languages spoken in country C, while Wroc, 
Wlang » Wfloc and Wflang are parameters of our model which need to be estimated 
from the data (one global value estimated for all users). Each of the w value gives 
a weight to the corresponding user attribute in the calculation of the nationality. 
To select the nationality for each user we simply select the country C with 
maximum No: N” = argmaxg Në. 


5 Evaluation 


To evaluate our strategy for identifying migrants we first propose an internal val- 
idation procedure. This defines gold standard datasets for residence and nation- 
ality and computes the classification performance of our two strategies to identify 
the two user attributes. The gold standard datasets are produced using profile 
information as they are provided by the users themselves. We then perform an 
external validation where we compare the migrant percentages obtained in our 
data with those from official statistics. 


5.1 Internal Validation: Gold Standards Derived from Our Data 


Residence. To devise a gold standard dataset for residence we consider profile 
locations set by users. We assume that if users declare a location in their profile, 
then that is most probably their residence. Very few users actually declare a 
location, and not all of them provide a valid one, thus we only selected profile 
locations that were identifiable to country level. Among the user accounts for 
which we could estimate the residence, 3,065 accounts had a valid country in 
their profile location. Using these accounts as our validation data, we computed 
the F1 score to measure the performance of our residence calculation. Table 1 
shows overall results, and also scores for the most common countries individually. 
The weighted average of the F1 score is 86%, with individual countries reaching 
up to 94%, demonstrating the validity of our residence estimation procedure. 


Nationality. In order to build a gold standard for nationality, we take into account 
the profile language declared by the users. The assumption is that profile languages 
can provide a hint of one’s nationality [13]. However, many users might not set their 
profile language, but use the default English setting. For this reason, we do not 
include into the gold standard users that have English as their profile language. 


Table 1. Average precision, recall and F1 scores, together with scores for the top 7 
residences in terms of support size. 


Weighted Avg | Macro avg | Micro avg | IT KW | US ID SG AU 
Fl-score | 0.858 0.716 0.856 0.928 | 0.839 | 0.703 | 0.945 | 0.83 | 0.891 


Precision | 0.879 0.745 0.856 0.935 | 0.989 | 0.572 | 0.949 | 0.946 | 0.883 
Recall 0.856 0.727 0.856 0.921 | 0.728 | 0.91 | 0.941 | 0.739 | 0.899 


Support | 3065 3065 3065 343 125 122 119 119 109 
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Table 2. Average precision, recall and F1 scores for top 8 nationalities in terms of 
support numbers 


Weighted avg | Macro avg | Micro avg | IT ES | TR |RU |FR | BR | DE | AR 
Fl-score | 0.99 0.98 0.72 0.99 | 0.96 | 0.98 | 0.95 | 0.94 | 0.95 | 0.92 | 0.97 
Precision | 0.99 0.98 0.73 1 0.94 | 0.98 | 0.98 | 0.9 | 0.96 | 0.91 | 0.98 
Recall 0.98 0.98 0.75 0.99 | 0.97 | 0.99 | 0.93 | 0.98 | 0.94 | 0.93 | 0.95 
Support | 12223 12223 12223 10781 | 302 | 173 | 146 | 118 | 113 |86 59 


The profile language, however, does not immediately translate into national- 
ity. While for some languages the correspondence to a country is immediate, for 
many others it is not. For instance, Spanish is spoken in Spain and most Amer- 
ican countries, so one needs to select the correct one. For this, we look at tweet 
locations. We consider all countries that match with the profile language and, 
among these, we select the one with the largest number of tweets, but only if the 
number of tweets from that country is at least 10% of the total number of tweets 
of that user. This allows to select the most probable country, also for users who 
reside outside their native country. If no location satisfies this criterion the user 
is not included in the gold standard. We were able to identify nationalities of 
12,223 users. Due to the fact that during data collection we focused on geo-tags 
in Italy, the dataset contains a significant number of Italians. 
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Fig. 3. Distribution of residences and nationalities of top 30 countries, for all users 
that possess both residence and nationality labels. 


We employed this gold standard dataset in two ways. First, we needed to 
select suitable values for the w weights from Eqs. 3—4. These show the importance 
of the four components used for nationality computation: own language and 
location, friends’ language and location. We performed a simple grid search and 
obtained the best accuracy on the gold standard using values 0 for languages 
and 2 and 1.5 for own and friends’ location, respectively. Thus we can conclude 
that it is the locations that are most important in defining nationality for twitter 
users, with a slightly stronger weight on the individual’s location rather than the 
friends. The final F1l-score, both overall and for top individual nationalities, are 
included in Table 2, showing a very good performance in all cases. 
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To assign final residences and nationalities to our core users, we combined the 
predictions with the gold standards (we predicted only if the gold standard was 
not present). Figure 3 shows the final distribution of residences and nationalities 
of top 30 countries for all users that have both the residence and nationality 
labels. The difference in the residence and nationality can be interpreted as 
either immigrants or emigrants. 
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Fig. 4. Comparison between the true and predicted data; the first two plots show 
predicted versus AIRE/EUROSTAT data on European countries. The last plot shows 
predicted versus AIRE data on non-European countries. 


5.2 External Validations: Validation with Ground Truth Data 


In order to validate our results with ground truth data, we study users labelled 
with Italian nationality and non-Italian residence, i.e. Italian emigrants. We com- 
puted the normalised percentage of Italian emigrants resulting from our data for 
all countries, and compared with two official datasets: AIRE (Anagrafe Italiani 
residenti all’estero), containing Italian register data, and Eurostat, the European 
Union statistical office. For comparison we use Spearman correlation coefficients, 
which allow for quantifying the monotonic relationship between the ground truth 
data and our estimation by taking ranks of variables into consideration. 

Figure 4 displays the various values obtained, compared with official data. 
A first interesting remark is that even between the official datasets themselves, 
the numbers do not match completely. The correlation between the two datasets 
is 0.91. Secondly we observed good agreement between our predictions and the 
official data for European countries. The correlation with AIRE is 0.753, while 
with Eurostat it is 0.711 when considering Europe. For non-European countries, 
however the correlation with AIRE data drops to 0.626. We believe the lower 
performance is due to several factors related to sampling bias and data quality 
in the various datasets. This includes bias on Twitter and in our methods, but 
also errors in the official data, which could be larger in non-EU countries due to 
less efficient connections in sharing information. 

All in all, we believe our method shows good performance and can be suc- 
cessfully used to build population level indices for studying migration. We do 
not aim to perform nowcasting of immigrant stocks, but rather to identify a 
population that can be representative enough for further analyses. 
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6 Case Study: Topics on Twitter 


In this section we show that our methodology can be employed to study how 
trending topics in Italy are also being discussed among Italian emigrants. As 
an example, we selected one hashtag that has been very popular in the last 
years: #Salvini. This refers to the Italian politician Matteo Salvini who served as 
Deputy Prime Minister and Minister of internal affairs in Italy until recently. To 
this, we added the top nine hashtags that appear frequently with #Salvini in our 
data: Berlusconi, Conti, Diciott, DiMaio, Facciamorete, Legga, M55, Migrant, 
Ottoemezzo. Indeed, they all represent people that are often mentioned together 
or political parties or other issues that are associated with the hashtag #Salvini. 
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Fig. 5. Stream graph: appearance of hashtags related to #Salvini from Italians across 
10 selected residence countries in 2018. The discussion continuously appeared in Italy 
throughout the year and it became more lively employed by Italians overseas as Salvini 
gained more political attention. 


Figure 5 shows an evolution of the usage of the 10 above mentioned hashtags 
across different Italian communities both within and abroad Italy. The values 
shown are the number of tweets from Italian nationals residing in each country 
that include one of the 10 hashtags, divided by the total number of tweets from 
Italian nationals from that country. Values are computed monthly. Thus, we 
show the monthly popularity of the topics in each country. In this way, even 
the tweets from less represented countries are well shown. As the figure shows, 
the hashtag was continuously used by Italians in Italy. We observed that the 
hashtag gradually spread over other residence countries as Salvini received more 
and more attention. We also observe that most of the attention comes from 
Italians residing in Europe, with non-European countries less represented. 


7 Conclusion and Future Work 


We have developed a new methodology to provide a snapshot of migrants within 
the Twitter population. We considered the length of stay in a country as the 
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key factor to define a user’s residence. As for the nationality, connections which 
migrants maintain with their country of origin provided us with a good indica- 
tion. In particular, the location of friends seemed to be a strong feature in deter- 
mining nationality, together with the location of the users themselves. Tweet 
language, on the other hand, was not considered relevant by our model. This 
is probably due to the fact that English is the dominating language on Twit- 
ter, since a language that is widely understood has to be spoken to get more 
attention from other users. We have validated our results both with internal and 
external data. The results show good classification performance scores and good 
correlation coefficients with official datasets. 

The constructed dataset can be applied in different scenarios. We have shown 
how it can be used to study trending topics on Twitter, and how attention is 
divided between emigrants and non-migrants of a certain nationality. In the 
future, we plan to analyse social ties, integration and assimilation of migrants 
[7]. At the same time, one can investigate the strength of the ties with the 
community of origin. 
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Abstract. The ability to detect an unusual concentration of extreme 
observations in a connected region of a graph is fundamental in a number 
of use cases, ranging from traffic accident detection in road networks to 
intrusion detection in computer networks. This task is usually performed 
using scan statistics-based methods, which require explicitly finding the 
most anomalous subgraph and thus are computationally intensive. 

We propose a more scalable method in the case where the observa- 
tions are assigned to the edges of a large-scale network. The rationale 
behind our work is that if an anomalous cluster exists in the graph, then 
the subgraph induced by the most individually anomalous edges should 
contain an unexpectedly large connected component. We therefore refor- 
mulate our problem as the detection of anomalous sample paths of a 
percolation process on the graph, and our contribution can be seen as a 
generalization of previous work on percolation-based cluster detection. 
We evaluate our method through extensive simulations. 


1 Introduction 


Detection of a significant connected subgraph in a larger background network is 
a ubiquitous task: such significant regions can be indicative of fraudulent behav- 
ior in social networks [15] or of the propagation of an intruder in a computer 
network [22], for instance. Therefore, being able to discern them from ambient 
noise has valuable applications in a number of settings. This anomaly detection 
problem is, however, remarkably challenging: the large size and complex struc- 
ture of real-world graphs make the characterization of normal behavior difficult 
and the search for non-trivial substructures computationally expensive. 

The aim of this paper is to propose a scalable method for anomalous con- 
nected subgraph detection in a graph with observations attached to its edges. The 
null distribution of the observations, or an approximation thereof, is assumed 
to be known. Building upon this knowledge, the degree of abnormality of each 
individual edge with respect to the model can be measured, and our goal is to 
detect a significant concentration of anomalous edges in a connected region of 
© The Author(s) 2020 
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the graph. Usual methods for this task are built around scan statistics [14]. Such 
methods boil down to maximizing a scoring function over the set of connected 
regions of the graph, then rejecting the null hypothesis (i.e. absence of anoma- 
lous subgraph) if the maximum exceeds a certain threshold. This implies solving 
a combinatorial optimization problem over the class of all connected subgraphs, 
which is expensive due to the exponentially growing size of the latter. 

In contrast, our approach does not require explicitly searching for the best 
candidate subgraph. Instead, we build on the following idea: under the null 
hypothesis, the most individually anomalous edges are randomly spread out 
over the graph. Therefore, removing all but the k most anomalous edges from 
the graph is equivalent to drawing k edges uniformly at random and extracting 
the subgraph induced by these edges. In other words, this procedure amounts to 
bond percolation on a graph. On the other hand, when an anomalous subgraph 
is present, the location of the individual anomalies is no longer random, and 
thus the largest connected component of the subgraph induced by the k most 
anomalous edges should contain an unexpectedly large connected component. 
This link between anomalous subgraph detection and percolation theory has 
already been introduced in the context of regular lattices [6,19,20], but to the 
best of our knowledge, it has not yet been studied for arbitrary graphs. 

We argue that our method is more scalable than traditional ones while 
retaining an acceptable detection power, especially when seeking to detect small 
anomalous regions in large graphs. We assess this detection performance through 
numerical experiments on several realistic synthetic graphs. 

The rest of this paper is structured as follows. In Sect.2, we introduce the 
statistical framework for our problem and present some related work. Section 3 
describes our detection method, while Sect. 4 is devoted to its empirical evalua- 
tion on simulated data. Finally, we discuss our results and some interesting leads 
for future work in Sect.5, then briefly conclude in Sect. 6. 


2 Problem Formulation and Related Work 


We begin with a thorough formulation of our problem as a case of statistical 
hypothesis testing, then review the main existing approaches to it. 


2.1 Problem Formulation — Statistical Hypothesis Testing 


Consider an undirected and connected graph G = (V,€), where V (resp. E) is 
the set of vertices (resp. edges) of G. Letting |A| denote the number of elements 
of a set A, we write m = |E|, and we use € and [m] = {1,...,m} interchangeably 
to represent the set of edges. We further write 24 for the set of all subsets of A 
and 1{-} for the indicator function of an event. 

Let A C 2© denote the class of subsets of € whose induced subgraph in G 
is connected. Given a signal X = (X1,...,Xm) E€ R” observed on the edges 
of G and a known probability distribution Fo, the null hypothesis is defined as 
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Ho: Xi = Fo. For each S € A, we further define the alternative 
: X\s ~x Fs 
S` \ vigs, Xi ~ Fo’ 


where Xs is the restriction of X to S and Fs is a joint probability distribution. 


Fs is only assumed to be different from FË ISI, and it can differ in various ways. 
In many applications, the observations in S are simply larger than expected 
(consider for instance network intrusion detection, where the presence of an 
intruder results in additional activity in a connected region of the network). The 
problem considered in this paper can be formulated as 


Ho vs. H= |) Hs. 
SEA 


That is, we want to know whether there exists a connected subgraph of G 
inside of which the observations X; are drawn from an alternative distribution. 
Note that we only care about detection, leaving the reconstruction of S aside. 


2.2 Related Work — Scan Statistics and Beyond 


A lot of existing work deals with a specific instance of the problem defined above, 
namely elevated mean detection on a graph. In this setting, the observations are 
independent standard centered normal random variables under the null, while X; 
has mean ps l{i € S} under the alternative Hs (for some us > 0). Theoretical 
conditions for detectability in this case are stated in [1]. A closely related problem 
arises when the observations are associated with vertices rather than edges, 
and this setting was studied in [3-5]. However, these papers focus on statistical 
analysis and do not provide computationally tractable tests. 

From a more practical perspective, the most common approach to anomalous 
subgraph detection is based on scan statistics. Broadly speaking, this method 
consists in defining a scoring function f : 2° — R, computing the test statis- 
tic t = maxsey f(S), then rejecting Ho if t exceeds a given threshold. This 
amounts to finding the most anomalous subset S* in A, and then rejecting the 
null hypothesis if S* is anomalous enough. Defining f requires some hypothe- 
ses on the class of alternative distributions {F's}. For instance, when Fs has a 
parametric form, f(S) can be defined as the likelihood ratio between Hs and 
Ho. In the more general case considered here, however, finding a suitable scoring 
function is non-trivial. Moreover, computing t implies maximizing f over the 
combinatorial class A, which quickly becomes computationally intensive as the 
graph grows. Therefore, most related work focuses on making the computation 
of scan statistics more efficient. Ways to achieve this include the following: 


Restriction of the Class A. The easiest way to speed up the computation is 
to simply reduce the size of the search space by considering only a subset of 
A. Such restriction can be based on domain-specific knowledge [17, 18,22, 25] 
or more general heuristics [24]. 
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Convex Relaxation. Another classical approach to combinatorial optimization 
consists in solving a convex relaxation of the problem, and then projecting 
the solution back onto the original search space. This method was applied to 
scan statistics [2,26,27], using elements of spectral graph theory [9] to find a 
relaxed form of the connectivity constraint. Similar ideas were also used in 
a slightly different context [29-31], where the class A consists of subgraphs 
with low cut size rather than connected ones. 

Algorithmic Approaches. Finally, efficient optimization algorithms have been 
used to find exact or approximate values for the scan statistic, including sim- 
ulated annealing [11,12], greedy algorithms [28], primal-dual algorithms [28], 
branch and bound algorithms [32] and dynamic programming algorithms [33]. 


Despite the popularity of scan statistics, other ideas have also been considered 
in the literature. We focus on one of these alternative approaches, namely the 
Largest Open Cluster (LOC) test, which was first studied in the context of object 
detection in images [19,20]. The idea of this method is to represent an image 
as a two-dimensional lattice, each node carrying a random variable standing 
for the value of the associated pixel. Then, after deleting from the lattice every 
vertex whose pixel value is lower than a suitable threshold, the largest remaining 
connected component is expected to be small if there is no object in the image. 
On the other hand, if an object is present, an unexpectedly large connected 
component should remain in the thresholded lattice. The theory behind the 
LOC test has since been extended to lattices of arbitrary dimension [6], but to 
the best of our knowledge, the underlying idea of using percolation theory to 
detect anomalous connected subgraphs has not yet been applied to complex, 
arbitrary-shaped networks. 


3 Local Anomaly Detection and Percolation Theory 


We now describe our method, first introducing some necessary notions of percola- 
tion theory, then highlighting their relevance to our anomaly detection problem. 
Finally, we provide a detailed description of our testing procedure. 


3.1 Some Notions of Percolation Theory 


An interesting aspect of the LOC test is that the behavior of its test statistic 
under the null hypothesis can be described using percolation theory. Therefore, 
we first review some useful results from this field, which motivate our approach. 
For more details, see for example [10] and references therein. Since our primary 
interest is in signals associated with edges, we focus on bond percolation, where 
edges of a connected graph with n vertices are occupied uniformly at random 
with probability p or unoccupied with probability 1 — p. 

Let C(p) denote the size of the largest connected component of this graph 
at occupation probability p. The main focus of percolation theory is to find the 
limit of C(p) as n becomes large. Extremal values of p yield obvious results: for 
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p = 0, C(p) = 1 for any n and for p = 1, limn... C(p) = co. For intermediate 
values of p, however, there are two possible regimes. If p is small enough, only 
small connected components are present and C(p)/n converges in probability 
to 0. On the other hand, larger values of p lead to the emergence of a giant 
connected component, which contains a constant fraction of the vertices. The 
transition between the two regimes happens for a critical value of p called the 
percolation threshold pe. Note that pe depends on the graph structure and can be 
vanishingly small. Although this phase transition is only well-defined in the limit 
of an infinite graph, a somewhat similar behavior can be observed in the finite 
case [8,16]. In particular, define the percolation process {C (p)}o<p<1 as follows: 
assign to each edge e an independent random variable Ue, uniformly distributed 
on [0,1]. Then, keeping the Ue fixed, let p vary on [0,1], deleting e from the 
graph whenever Ue > p. A tightly related process is obtained by considering 
the imbedded Markov chain {Qk}k>0, where Gy is the subgraph induced by 
the edges associated with the k smallest random variables. Letting Ck denote 
the size of the largest connected component of Gk, {Ck}k>0 can be seen as a 
discretized version of {C(p)}o<p<1. Even for finite graphs, sample paths of these 
two processes do not deviate significantly from the mean trajectory, making them 
suitable candidates for anomaly detection. 


3.2 Application to Anomalous Subgraph Detection 


We now motivate the idea of mapping a signal X onto a sample path of the 
percolation process. For i € [m], define P; = 1— Fo(X;) as the upper tail p-value 
associated with X;. Define also, for k € {0,...,m}, the subgraph G, induced 
by the edges associated with the k smallest p-values, and let Sp denote the 
size of its largest connected component. Under the null hypothesis, the random 
variables {P;} are independent and uniformly distributed on [0,1]. Therefore, 
Sp has the same distribution as Cx for all k € {0,...,m}. Under the alternative 
Hs, however, the distribution of the variables {P;}ies is altered, which induces 
a deviation in the process {S;}o<k<m With respect to the normal percolation 
process. Our test aims to detect this deviation. 

Figure | illustrates the normal and anomalous behaviors of the percolation 
process for three graph models: a two-dimensional square lattice, an Erdés-Rényi 
random graph [13] and a Barabasi-Albert preferential attachment graph [7]. 
For each model, a graph with 1024 vertices and approximately 2000 edges is 
generated, and the mean and standard deviation of the fraction of vertices in the 
largest connected component for each value of p is estimated using 10000 Monte 
Carlo simulations. Then, for each graph, we generate a subtree S containing a 
fraction 6 of the vertices, assign to each edge e an independent Gaussian random 
variable Xe ~ N(pl{e € S},1) and compute the associated sample path of the 
percolation process. This experiment was repeated 1000 times for each graph, 
and the mean sample path for different values of 6 and u is displayed. The two 
regimes of the percolation process can be observed, and the shape and location 
of the phase transition both clearly depend on the graph model. While the 
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Fig. 1. Evolution of the fraction of vertices in the largest connected component as 
p varies from 0 to 1, under Ho and various alternatives, for three kinds of graphs: 
a two-dimensional square lattice (left), an Erdés-Rényi random graph (center) and a 
Barabasi-Albert preferential attachment graph (right). 


separation between the two regimes is quite clear for the lattice and the Erdős- 
Rényi graph, it is much blurrier for the Barabasi-Albert model, which yields more 
complex structures — most interestingly, heavy-tailed degree distributions. Since 
such properties are often found in real-world networks, it is important to qualify 
their impact on the feasibility of percolation-based cluster detection. Figure 1 
shows that although the anomalous sample paths become harder to distinguish 
as the phase transition gets hazier, the normal trajectories are concentrated 
enough to make even small deviations visible, which motivates our approach. 


3.3 Putting It All Together — Description of Our Test 


We now proceed with the description of our test. First, define 


K=min{k<m, to[Sz] > Pi}, 


where Eo denotes the expected value under Hp. K can be understood as the 
index corresponding to the onset of the phase transition. Since we aim to detect 
the appearance of an unexpectedly large connected component in the early steps 
of the percolation process, the test statistic we use is 


1 K 
= Sh. 
X DK 2 k 


k=1 


This statistic is equivalent to the area under a piecewise constant interpolation 
of the sequence of points {(k, Sk) }o<x<x, and is therefore expected to be higher 
than usual in the presence of an anomalous subgraph. 

Estimation of K and calibration of the test are both done through Monte 
Carlo simulation: using the Newman-Ziff algorithm [23], N random sample paths 


Percolation-Based Detection of Anomalous Subgraphs in Complex Networks 293 


of the imbedded Markov chain are computed. Let {8 Jockem denote the tra- 
jectory of the largest connected component’s size for the ith realization of the 
process. We get the following estimates: 


N K 
a 1 @s Ay > 1 
K = min k<m,—)S See >V/|Vi >, X= = J Sk. 
N& k J an V|- È k 


Finally, the empirical p-value can be expressed as 


; for i € {1,..., N}. 
i=l “WE 


4 Experiments 


In order to assess the power of our test, we ran it on several synthetic graphs 
containing random anomalous trees. This section describes the procedure we 
used to generate the dataset, then presents our results and their interpretation. 


4.1 Generation of the Dataset 


The dataset is generated using the stochastic Kronecker graph model [21]. 
Kronecker graphs exhibit similar structural properties as real-world networks, 
most importantly power law-distributed degrees and small diameter. Hence, this 
model allows us to evaluate our test in a somewhat realistic setting. 

Two parameter matrices are used: ©; = [0.9 0.5; 0.5 0.3] (core-periphery net- 
work) and @2 = [0.9 0.2; 0.2 0.9] (hierarchical network). For a given matrix and 
for i € {12,13,14,15}, we generate an undirected graph through i iterations of 
the Kronecker product, and only the largest connected component of this graph 
is kept in order to obtain a connected network with approximately 2* vertices. 
Using this procedure, 10 graphs are generated for each pair of parameters (O, i). 
Thus, we evaluate our test on graphs with sizes ranging from a few thousands 
to a few tens of thousands of vertices, which covers a wide scope of potential use 
cases. For each synthetic graph, anomalies are then generated as follows: given 
ô € (0,1), a random subtree S containing a fraction 6 of the vertices is drawn. 
Then, a random observation Xe ~ N(pl{e € S},1) is independently drawn 
for each edge e of the graph (where u is a fixed signal strength). For a given 
graph and a pair of parameters (ô, u), 1000 anomalous signals X = (X1,...,Xm) 
are generated. 1000 signals are also drawn from the null distribution (that is, 
X ~ N(0,Im), where Im is the m x m identity matrix) for comparison. Finally, 
for each graph, the null distribution of the test statistic is estimated using 10000 
random realizations of the percolation process. Using the obtained histogram, 
the empirical p-values associated with the normal and anomalous samples are 
derived, and we construct the Receiver Operating Characteristic (ROC) curve 
for each pair (ô, u). This procedure exposes the influence of various parameters 
on the performance of our test, namely the graph size, the generator matrix, the 
size ô of the anomalous region and the signal strength p. 
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4.2 Detectability Conditions — Empirical Study 


Our results are displayed in Table 1 and Figs.2 and 3. Our main interest is in 
finding out which parameters have the strongest influence on the power of the 
test, and we provide some key observations and interpretations below. 


0.001 


6= 


0.005 


6= 


TPR 
6=0.01 


6=0.05 


Fig. 2. Aggregated ROC curves of our test for 10 Kronecker graphs with initial matrix 
OQ, = [0.9 0.5; 0.5 0.3], for several values of the number of Kronecker product iterations 
i, the proportion 6 of vertices in the anomalous tree and the signal strength p. 


Influence of the Graph Size. The first thing we notice in Figs. 2 and 3 is that for a 
given pair of parameters (ô, p), the performance of the test consistently improves 
as the size of the graph increases. One possible explanation for this comes from 
percolation theory: before the phase transition, the size of the largest connected 
component is sublinear in the size of the graph. This implies that, for a fixed 
ratio of vertices in the anomalous component, the difference between the size 
of the latter and the expected size of the largest component grows with the 
graph size. Therefore, the anomalous component becomes more visible as the 
graph grows. Note, however, that some structural properties of our synthetic 
graphs (e.g. density) might not remain identical for different values of i. It is 
thus difficult to pinpoint the actual influence of the sole number of vertices. 
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Fig. 3. Aggregated ROC curves of our test for 10 Kronecker graphs with initial matrix 
©ə2 = [0.9 0.2; 0.2 0.9], for several values of the number of Kronecker product iterations 
i, the proportion 6 of vertices in the anomalous tree and the signal strength p. 


Trade-Off Between 6 and u. As could be intuitively expected, our test performs 
better for higher values of 6 and u. More interestingly, these two parameters 
are intertwined: what makes an anomalous subgraph detectable is not only the 
number of vertices it contains (which is controlled by ô), but also the presence 
of a sufficient fraction of its edges among the most individually anomalous edges 
of the graph (which is controlled by p). In terms of experimental results, this 
translates to poor performance when at least one of these parameters is too low. 
However, there seems to be a range of values of 6 and yz in which a decrease in 
one can be made up for by an increase in the other. In particular, this implies 
that even small subgraphs can be detected by our test as long as the signal is 
strong enough. This is useful in “needle-in-a-haystack” scenarios such as network 
intrusion detection, where the anomalies one looks for are often localized. 


Influence of the Graph Structure. As evidenced by Fig. 1, structural properties 
of the graph heavily influence the normal behavior of the percolation process, in 
turn affecting the viability of percolation-based cluster detection. This explains 
the observable difference in detection power between the two kinds of graphs we 
consider. Further analysis shows that the generator O; yields more heavy-tailed 
degree distributions, which is a plausible cause for the performance gap. 
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5 Discussion and Future Work 


We now discuss the main properties of our test, identifying some limitations and 
providing leads for future work. 


Table 1. Aggregated AUC score of our test for 10 Kronecker graphs, using various 
combinations of initial matrix ©, number of iterations of the Kronecker product i, 
proportion 6 of vertices in the anomalous tree and signal strength p. 


Oı O2 

6 = 0.001 | 0.005 | 0.01 |0.05 | 0.001 |0.005/0.01 |0.05 

i = 12 | u = 1 0.502 0.510 | 0.525 | 0.591 | 0.502 | 0.527 | 0.582 | 0.796 
1.5 0.505 0.542 | 0.603 | 0.819 | 0.502 | 0.626 | 0.763 | 0.990 
0.503 0.628 | 0.769 | 0.981 | 0.505 | 0.785 | 0.949 | 1.000 
i=13 1 0.507 0.513 | 0.528 | 0.602 | 0.505 | 0.540 | 0.595 | 0.838 
1.5 0.513 0.560 | 0.631 | 0.847 | 0.512 | 0.694 | 0.848 | 0.998 
2 0.518 0.699 | 0.845 | 0.993 | 0.531 | 0.902 | 0.988 | 1.000 
i=14 1 0.503 0.515 | 0.525 | 0.596 | 0.503 | 0.550 | 0.614 | 0.867 
1.5 0.508 0.570 | 0.639 | 0.855 | 0.524 | 0.764 | 0.908 | 1.000 
2 0.528 0.752 | 0.887 | 0.997 | 0.590 | 0.969 | 0.998 | 1.000 
i=151 0.500 0.509 | 0.522 | 0.586 | 0.508 | 0.565 | 0.634 | 0.897 
1.5 0.511 0.584 | 0.645 | 0.861 | 0.555 | 0.840 | 0.955 | 1.000 
0.551 0.801 | 0.925 | 0.999 | 0.706 | 0.994 | 1.000 | 1.000 


Theoretical Guarantees. From a theoretical perspective, our setting is more com- 
plex than that of [6]: we consider arbitrary networks instead of regular lattices, 
and our test statistic depends on the whole sample path of the percolation pro- 
cess rather than the marginal behavior at a given occupation probability. There- 
fore, the search for theoretical guarantees for our test was left out of the scope 
of this work, although it would certainly be of great interest. 


Computational Cost. The main advantage of our method is its computational 
efficiency. Indeed, computing the empirical p-value for a given graph and an 
observed signal only requires N + 1 runs of the Newman-Ziff algorithm, which 
has a very low cost. In contrast, a scan statistic-based test would require N + 1 
runs of a combinatorial optimization algorithm (one for the observed data and 
N additional runs to estimate the distribution of the test statistic under the 
null). Even with a very efficient optimization method, this is significantly more 
intensive. In terms of complexity, our test requires sorting the observations X;, 
running the Newman-Ziff algorithm N + 1 times, computing the mean sample 
path and the index K, and summing the first K values for each of the N + 1 
trajectories, resulting in O(m(logm + N)) operations. Note that the algorithm 
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can be further optimized using the fact that the test statistic depends only 
on the first K steps of the percolation process. Although the exact value of K 
depends on the graph, we empirically observe that it is generally smaller than the 
number of vertices |V|. Therefore, early stopping of the Newman-Ziff algorithm 
and partial sorting can reduce the complexity to O(m + |V|(N + log |VJ)). 


Detection Power. The expected downside of our method’s low computational 
cost is a loss in detection power. Our simulations show, however, that the pro- 
posed test can detect reasonably small anomalous subgraphs in large enough 
ambient graphs, which is our main goal here. Moreover, it does not rely on prior 
knowledge of the alternative distribution and can be used with only a rough 
estimate of Fo, which improves its usability in realistic settings. 

Although the influence of some factors on the performance of the test was 
left out of the scope of this work, a wider analysis would be an interesting topic 
for future work. These factors include the density of the graph and the shape 
of the anomalous subgraph. More specifically, we only evaluated our test in the 
case of random anomalous trees, which provides general results but no insight 
into the influence of the diameter and the density of the anomalous subgraph. 


6 Conclusion 


By extending previous work on percolation-based cluster detection to a more 
general setting, we propose a computationally efficient test to detect an anoma- 
lous connected subgraph in an edge-weighted network. The underlying intuition 
is that it is often possible to find out whether such a subgraph is present with- 
out explicitly finding it: instead of enumerating all possible candidates, a much 
faster method can be obtained by looking for properties of the whole graph which 
are affected by the apparition of an anomalous cluster. Our work suggests that 
percolation theory can provide such properties. 

Since it scales easily to large graphs and does not rely on extensive knowledge 
of the null and alternative distributions of the observed signal, we argue that 
our method is applicable to real-world problems. Moreover, we show through 
extensive simulations that its detection power remains acceptable, and that it 
can in particular detect small anomalous regions in large graphs. Therefore, we 
think the link between cluster detection and percolation theory deserves further 
exploration, both from a theoretical and applied point of view. 
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Abstract. The majority of research on community detection in 
attributed networks follows an “early fusion” approach, in which the 
structural and attribute information about the network are integrated 
together as the guide to community detection. In this paper, we pro- 
pose an approach called late-fusion, which looks at this problem from 
a different perspective. We first exploit the network structure and node 
attributes separately to produce two different partitionings. Later on, 
we combine these two sets of communities via a fusion algorithm, where 
we introduce a parameter for weighting the importance given to each 
type of information: node connections and attribute values. Extensive 
experiments on various real and synthetic networks show that our late- 
fusion approach can improve detection accuracy from using only network 
structure. Moreover, our approach runs significantly faster than other 
attributed community detection algorithms including early fusion ones. 


Keywords: Community detection - Attributed networks - Late fusion 


1 Introduction 


In many modern applications, data is represented in the form of relationships 
between nodes forming a network, or interchangeably a graph. A typical charac- 
teristic of these real networks is the community structure, where network nodes 
can be grouped into densely connected modules called communities. Community 
identification is an important issue because it can help to understand the net- 
work structure and leads to many substantial applications [6]. While traditional 
community detection methods focus on the network topology where communities 
can be defined as sets of nodes densely connected internally, recently, increasing 
attention has been paid to the attributes associated with the nodes in order 
to take into account homophily effects, and several works have been devoted 
to community detection in attributed networks. The aim of such process is to 
obtain a partitioning of the nodes where vertices belonging to the same subgroup 
are densely connected and homogeneous in terms of attribute values. 
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In this paper, we propose a new method designed for community detection 
in attributed networks, called late fusion. This is a two-step approach where we 
first identify two sets of communities based on the network topology and node 
attributes respectively, then we merge them together to produce the final par- 
titioning of the network that exhibits the homophily effect, according to which 
linked nodes are more likely to share the same attribute values. The commu- 
nities based upon the network topology are obtained by simply applying an 
existing algorithm such like Louvain [2]. For graphs whose node attributes are 
numeric, we utilize existing clustering algorithms to get the communities (i.e., 
clusters) based on node attributes. We extend to binary-attributed graphs by 
generating a virtual graph from the attribute similarities between the nodes, and 
performing traditional community detection on the virtual graph. Albeit being 
simple, extensive experiments have shown that our late-fusion method can be 
competitive in terms of both accuracy and efficiency when compared against 
other algorithms. We summarize our main contributions in this work are: 


1. A new late-fusion approach to community detection in attributed networks, 
which allows the use of traditional methods as well as the integration of 
personal preference or prior knowledge. 

2. A novel method to identify communities that reflect attribute similarity for 
networks with binary attributes. 

3. Extensive experiments to validate the proposed method in terms of accuracy 
and efficiency. 


The rest of the paper is organized as follows: In Sect. 2, we provide a brief 
review of community detection algorithms suited for attributed networks, next 
we present our late fusion approach in Sect. 3. Experiments to illustrate the effec- 
tiveness of the proposed method are detailed in Sect. 4. Finally, we summarize 
our work and point out several future directions in Sect. 5. 


2 Related Work 


How to incorporate the node attribute information into the process of network 
community detection has been studied for a long time. One of the early ideas 
is to transform attribute similarities into edge weights. For example, [13] pro- 
poses matching coefficient which is the count of shared attributes between two 
connected nodes in a network; [15] extends the matching coefficient to networks 
with numeric node attributes; [4] defines edge weights based on self-organizing 
maps. A drawback of these methods is that new edge weights are only appli- 
cable to edges already existed, hence the attribute information is not fully uti- 
lized. To overcome this issue, a different approach is to augment the original 
graph by adding virtual edges and/or nodes based on node attribute values. For 
instance, [14] generates content edges based on the cosine similarity between 
node attribute vectors, in graphs where nodes are textual documents and the 
corresponding attribute vector is the TF-IDF vector describing their content. 
The kNN-enhance algorithm [9] adds directed virtual edges from a node to one 
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of its kK-nearest neighbors if their attributes are similar. The SA-Clustering [17] 
adds both virtual nodes and edges to the original graph, where the virtual nodes 
represent binary-valued attributes, and the virtual edges connect the real nodes 
to the virtual nodes representing the attributes that the real nodes own. 

Another class of methods is inspired by the modularity measure. These meth- 
ods incorporate attribute information into an optimization objective like the 
modularity. [5] injects an attribute based similarity measure into the modular- 
ity function; [1] combines the gain in the modularity with multiple common 
users’ attributes as an integrated objective; I-Louvain algorithm [3] proposes 
inertia-based modularity to describe the similarity between nodes with numeric 
attributes, and adds the inertia-based modularity to the original modularity 
formula to form the new optimization objective. 

With the wide spreading of deep learning, network representation learning 
and node embedding (e.g. [8]) motivated new solutions. [12] proposes an embed- 
ding based community detection algorithm that applies representation learning 
of graphs to learn a feature representation of a network structure, which is com- 
bined with node attributes to form a cost function. Minimizing it, the optimal 
community membership matrix is obtained. 

Probabilistic models can be used to depict the relationship between node 
connections, attributes, and community membership. The task of community 
detection is thus converted to inferring the community assignment of the nodes. 
A representative of this kind is the CESNA algorithm [16], which builds a gen- 
erative graphical model for inferring the community memberships. 

Whereas the majority of the previous methods exploit simultaneously both 
types of information, we propose the late-fusion approach that combines two 
sets of communities obtained separately and independently from the network 
structure and node attributes via a fusion algorithms. 


3 The Late-Fusion Method 


Given an attributed network G = (V, E, A), with V being the set of m nodes, 
E the set of n edges, and A an m x r attribute matrix describing the attribute 
values of the nodes with r attributes, the goal is to build a partitioning P = 
{C\,...,Cx} of V into k communities such that nodes in the same community 
are densely connected and similar in terms of attributes, whereas nodes from 
distinct communities are loosely connected and different in terms of attribute. 

For networks with numeric attributes, we can directly apply a community 
detection algorithm F on G to identify a set of communities based on node 
connections Ps = {C1, C2, ..., Ck, }, and a clustering algorithms Fa on A to find 
a set of clusters based on node attributes Pa = {C1, C2, -.., Cka }. When it comes 
to binary attributed networks, traditional clustering algorithms become inacces- 
sible, we instead build a virtual graph G, that shares the same node set as G, 
but there is an edge only when the two nodes are similar enough in terms of 
attributes. Then we apply F, on Ga and obtain Pa. Note that we omit cate- 
gorical attributes since categorical values can be easily converted to the binary 
case. 
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The second step is to combine the partitions P, and Pa. We first derive the 
adjacency matrices D, and Da from P, and Pa respectively, where dj; = 1 when 
nodes 7 and j are in the same community in a partitioning P and d;; = 0 other- 
wise. Next, an integrated adjacency matrix D is given by D = aD,+(1—a)Da. 
Here a is the weighting parameter that leverages the strength between network 
topology and node attributes. In this way, the information about network topol- 
ogy and node attributes of the original graph G is represented in D. Now Gint, 
derived from the adjacency matrix D, is an integrated, virtual, weighted graph 
whose edges embody the homophily effect of G. Algorithm 1 shows the steps of 
our late-fusion approach applied to networks with binary attributes. 


Algorithm 1. Late-fusion on networks with binary attributes 
Input: G = (V, E, A), Fs,a 
Output: P = {C1, C2, ..., Ck} 


1 Ps = F, (Gs) 

2 Ga = build_virtual_graph (A) 

3 Pa = F; (Ga) 

4 Ds = get-adjacency_matrix(P;), Da = get-adjacency-matrix(Pa) 
5 D=aD,+(1—a)Da 

6 Gintegratea = from_adjacency_matrix (D) 

7P= F; (Gintegrated) 

8 return P 


Here we address an important detail: how to build the virtual graph Ga from 
the node-attribute matrix A? We compute the inner product as the similarity 
measure between each node pair, and if the inner product exceeds a predeter- 
mined threshold, we regard the nodes as similar and add a virtual edge between 
them. The threshold can be determined heuristically based on the distribution of 
the node similarities. However, the threshold should be chosen properly so that 
the resulted G, would be neither too dense nor too sparse, where both cases 
could harm the quality of the final communities. Under this guidance, we put 
forward two thresholding approaches: 


1. Median thresholding (MT): Suppose S is the m x m similarity matrix 
of all nodes in V, we take all the off-diagonal, upper triangular (or lower 
triangular) entries of S, find the median of these numbers and set it as the 
threshold. This approach guarantees that we add virtual edges to half of all 
node pairs who share a similarity value higher than the other half. 

2. Equal-edge thresholding (EET): We compute q = 1 — d(G) where d(G) 
is the density of G. Then the qt” quantile of the similarity distribution is the 
chosen threshold. In this approach, we let the original graph G, be the proxy 
that decides how we construct the virtual graph Ga. 
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Fig.1. Node attribute distribution for three groups of experiments. (a) Strong 
attributes, (b) Medium attributes, (c) Weak attributes. Each color represents a unique 
community (Color figure online) 


4 Experiments 


Our proposed method has been evaluated through experiments on multiple syn- 
thetic and real networks and results are presented in this section. For networks 
with numeric attributes, we take advantage of existing clustering algorithms to 
obtain communities based on attributes (i.e., clusters), and for networks with 
binary attributes, we employ Algorithm 1 to perform community detection. We 
have also released our code so that readers can reproduce the results!. 


4.1 Synthetic Networks with Numeric Attributes 


Data. We use an attributed graph generator [10] to create three attributed 
graphs with ground-truth communities, denoted as G'strong, Gmedium and Gweak; 
indicating the corresponding ground-truth partitionings are strong, medium, and 
weak in terms of modularity Q. To examine the effect of attributes on community 
detection, for each of Getrong, Gmedium and Gweak, we assign three different 
attribute distributions as shown in Fig.1, where attributes in Fig. 1a and b are 
generated from a Gaussian mixture model with a shared standard deviation, and 
Fig. 1c presents the original attributes generated by [10]. By this way, for each 
graph having a specific community structure (Gstrong, Gmedium; Gweak) We have 
also three types of attributes denoted strong attributes, medium attributes and 
weak attributes leading in fact to 9 datasets. 


Evaluation Measures and Baselines. Normalized Mutual Information 
(NMI) and Adjusted Rand Index (ARI) and running time are used to evaluate 
algorithm accuracy and efficiency. Louvain [2] and SIWO [7] have been chosen as 
baseline algorithms that utilize only the links to identify network communities. 


1 https://github.com/changliu94/attributed-community-detection. 
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Table 1. Properties of synthetic networks Table 2. Properties of Sina Weibo 


network 
m jn k |r|Q m jn k lr iQ H 
Gstrong |2000|7430|10|2/0.81 3490/30282|10/10/0.05/0.04 
Gmedium |2000|7445|10|210.65 
Gweak |2000/6988|10/2/0.54 


Note that since the attribute distribution does not affect Louvain and SIWO, 
the results of Louvain and SIWO are only presented in Table 3. We choose Spec- 
tral Clustering (SC) and DBSCAN as two representative clustering algorithms 
as they both can handle non-flat geometry. We treat the number of clusters as a 
known input parameter of SC, and the neighborhood size of DBSCAN is set to 
the average node degree. We adopt default values of the remaining parameters 
from the scikit-learn implementation of these two algorithms. Finally, we take 
the implementation of the I-Louvain algorithm which exploits links and attribute 
values as our contender. The code of I-Louvain is available online?. Given Lou- 
vain, SIWO, SC, and DBSCAN, correspondingly we can have four combinations 
for our late-fusion method. In all experiments, the œ parameter in Algorithm 1 
is chosen to be 0.5, i.e., the same weight is allocated to structural and attribute 
information. 


Table 3. Results of strong attributes, time is measured in seconds 


Gstrong Gmedium Guweak 

NMI | ARI | Time | NMI| ARI Time NMI ARI | Time 
Louvain .795 |.797 |0.41 | .695 | .686 | 0.49 .665 | .674 | 0.64 
SIWO .836 | .850/0.97 |.702 |.705|1.09 |.504 .458 |0.98 
SC .802 |.713 | 1.15 |.777 |.677 | 0.64 |.768 .669 |0.68 
DBSCAN .469 |.103 | 0.06 |.434 |.083 0.06 |.465 |.102 |0.24 
I-Louvain .515 |.150 | 39.2 |.718 |.704 | 30.0 |.608 .503 | 37.6 
Louvain + SC .824 |.704 | 7.34 |.784 |.618 | 5.74 |.765 .597 |7.14 
Louvain + DBSCAN |.818 |.813 |8.64 |.730 |.702 8.87 |.704 | .690 | 10.6 
SIWO + SC .844 | .738 | 10.3 |.786 | .636 | 7.33 |.723 | .508 | 6.46 
SIWO + DBSCAN |.818 |.813 | 11.7 |.730 |.702 10.2 |.704 | .690| 11.6 


? https://www.dropbox.com/sh/j4aqitujiaifgq4/AAAAHOL3uUIPYNWKoLpcAh0TPa. 
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Table 4. Results of medium attributes, time is measured in seconds 


Gstrong Gmedium Guweak 

NMI | ARI | Time | NMI | ARI | Time NMI ARI | Time 
SC .529 |.338 | 0.83 |.522 |.322 | 0.53 |.538 .349 |0.57 
DBSCAN .096 |.012 |0.08 |.066 |.008 | 0.14 |.065 .011 |0.09 
I-Louvain .517 |.150 | 36.8 |.707 | .690 | 33.7 |.614 | .522 | 33.2 
Louvain + SC .734 |.450 | 5.62 |.696 |.390 | 5.96 |.677 .392 | 5.66 
Louvain + DBSCAN | .755 | .726 | 9.20 |.670 |.636 | 11.9 |.641 | .633 | 13.6 
SIWO + SC .748 |.469 | 12.7 |.699 |.402 |7.12 |.625 |.335 | 7.44 
SIWO + DBSCAN .744 |.726 |8.73 | .670 |.636 |8.98 |.641 | .633 | 12.4 


Results. Table 3, corresponding to strong attributes, shows that late fusion is 
the best-performing algorithm in terms of NMI on Gstrong and Gmedium, and 
very close to SC on Gweak (0.765 against 0.768) whereas it is better in terms of 
ARI on this last graph. On Tables 4 and 5, corresponding respectively to medium 
and weak attributes, with the deterioration of the attribute quality, the accu- 
racy of late-fusion degrades, but late fusion still remains at a consistently high 
level compared to I-Louvain and the clustering algorithms. Moreover, the perfor- 
mance degradation of late-fusion methods is less susceptible to the deterioration 
of community quality compared to the clustering algorithms, thanks to the com- 
plementary structural information. As for the running time, it is expected that 
classic community detection algorithms Louvain and SIWO are the fastest algo- 
rithms, as they do not consider node attributes, but the late-fusion method still 
outperforms I-Louvain by a remarkable margin. 


Table 5. Results of weak attributes, time is measured in seconds 


Gstrong Gmedium Gweak 

NMI| ARI | Time | NMI | ARI | Time NMI ARI | Time 
SC .483 |.270 | 3.31 |.514 |.307 |2.32 |.489 |.276 |2.45 
DBSCAN .000 | .000 |0.06 |.000 |.000 |0.06 |.000 .000 |0.14 
I-Louvain .517 |.150 | 35.1 |.707 |.690 | 34.3 |.614 |.522 |39.5 
Louvain + SC .770 |.670 |11.8 |.705 |.613 |10.2 |.689 .564 |9.33 
Louvain + DBSCAN | .795 |. 797 | 11.2 |.695 |.685 | 10.4 |.667 | .674 | 12.9 
SIWO + SC -797 | .703 | 13.2 | .709 | .635 | 12.3 |.601 |.467 |11.0 
SIWO + DBSCAN .795 |.797|11.6 |.695 |.685 |11.3 |.667  .674 | 12.6 
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4.2 Real Network with Numeric Attributes 


Data and Baselines. Sina Weibo? is the largest online Chinese micro-blog 
social networking website. Table2 shows the corresponding properties of the 
Sina Weibo network built by [9]*. It includes within-inertia ratio J, a measure 
of attribute homogeneity of data points that are assigned to the same subgroup. 
The lower the within-inertia ratio, the more similar the nodes in the same com- 
munity are. As DBSCAN algorithm performs poorly on the Sina Weibo network 
and it is costly to infer a good combination of the hyper-parameters of the algo- 
rithm, it has been replaced by k-means as a supplement to spectral clustering. 
The number of clusters required as an input by k-means and SC is inferred from 
the ‘elbow method’, which happens to be 10, the actual number of clusters. 
Moreover, since we have the prior knowledge that the ground truth communities 
are based on the topics of the forums from which those users are gathered, we 
reckon that the formation of communities depends more on the attribute values 
than the structure and set the parameter a at 0.2. 


Results. Table 6 presents the results on Sina Weibo network. The two baseline 
algorithms Louvain and SIWO and the contending algorithm I-Louvain perform 
poorly on the Sina Weibo network, whereas the clustering algorithms show a high 
accuracy. Especially, the k-means algorithm together with our four late-fusion 
methods with the emphasis on attribute information produce results with the best 
NMI and ARI. This is because modularity of Sina Weibo network is low (0.05 as 
indicated in Table 2) and the within-inertia ratio is also low (0.04). The results also 
validate our assumption that communities in this network are mainly determined 
by the attributes. We will further explore the effect of a in Sect. 4.4. 


Table 6. Experimental results on Sina Table 7. Properties of Facebook net- 


Weibo network works 
NMI ARI Time Network ID m n k r JQ 
Louvain .232 |.197 1.98 0 347 | 5038 |24 |224 |0.179 
SIWO .040 |.000 3.26 107 1045 |53498 | 9 576 |0.218 
SC .612 |.520 3.16 348 227 | 6384 |14 |161 |0.210 
k-means .649 .579 0.25 414 159 | 3386 | 7 105 |0.468 
I-Louvain .204 |.038 261 686 170 | 3312 |14 | 63 |0.101 
Louvain+SC 611 |.519 |48.9 698 66 | 540 |13 | 48 |0.239 
Louvain+k-means .649 .579 42.1 1684 792 |28048 |17 319 |0.509 
SIWO+SC ‘611 |.519 37.9 1912 755 |60050 |46 480 |0.339 
SIWO+k-means .649 .579 50.4 3437 547 | 9626 |32 262 |0.026 
3980 59 | 292 |17 | 42 |0.242 


3 http://www.weibo.com. 
4 This dataset is available online https://github.com/smileyan448/Sinanet. 
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4.3 Real Network with Binary Attributes 


Data. Facebook dataset [11] contains 10 egocentric networks with binary 
attributes corresponding to anonymous information of the user about the name, 
work, and education and ground-truth communities. This dataset is available 
online® and Table7 presents the properties of these networks. 

We still treat Louvain and SIWO as our baselines. We use the CESNA algo- 
rithm [16], able to handle binary attributes in addition to the links, as our 
contender®. To compare the two thresholding strategies proposed in Section 3, 
we present experimental results of four late-fusion methods: Louvain + equal- 
edge thresholding (denoted as Louvain-EET), Louvain + median thresholding 
(denoted as Louvain-MT), SIWO + equal-edge thresholding (denoted as SIWO- 
EET), and SIWO + median thresholding (denoted as SIWO-MT). We set a to 
its default value 0.5. 


Table 8. NMI of different community detection results on Facebook network 


Network ID 0 107 | 348 | 414 | 686 | 698 | 1684 | 1912 | 3437 | 3980 | Average 
Louvain -382 | .332 | .478 | .609 | .284 | .281 | .047 | .565 | .181 | .729 | .389 
SIWO .390 | .363 |.375 |.586 |.215 |.259 | .053 |.557 |.174 | .605 | .358 
CESNA .263 | .249 | .307 | .586 | .238 | .564 | .438 | .450 | .176 | .552 | .382 


Louvain-EET | .558 | .355 | .525 | .538 | .463 | .669 | .462 | .511 | .310 | .704 | .509 
Louvain-MT | .452 | .341 | .489 | .556 | .351 | .479 | .323 | .491 | .262 | .696 | .444 
SIWO-EET 541 | .364 | .452 | .531 | .406 | .630 | .460 | .509 | .310| .648 | .485 
SIWO-MT 431 | .353 | .405 | .538 | .252 | .406 | .332 | .491 | .260 | .588 | .406 


Table 9. ARI of different community detection results on Facebook network 


Network ID |0 107 | 348 | 414 | 686 | 698 | 1684 | 1912 | 3437 | 3980 Average 


Louvain 143 | .148 | .303 | .558 | .110 | .000 | .000 | .461 | .000 | .398 | .209 
SIWO .220 | .177 | .127 .519 | .000 | .009 | .000 | .419 | .002 | .209 | .167 
CESNA .073 | .097 | .156 | .480 | .001 | .202 | .310 | .361 | .014 | .067 | .176 


Louvain-EET | .024 | .047  .103 | .265 | .006 | .000 | .043 | .252 | .000 | .069 | .008 
Louvain-MT | .061 | .079  .129 | .413 | .063 | .000 | .048 | .235 | .000 | .084 | .110 
SIWO-EET | .043| .045  .124 | .252 | .003 | .000 | .057 | .235 | .000 | .095 | .009 
SIWO-MT -108 | .079 | .141 | .391 | .040 | .016 | .060 | .223 | .000 | .073 | .113 


5 http://snap.stanford.edu/data. 
ê The source code of CESNA is available online https://github.com/snap-stanford/ 
snap/tree/master/examples/cesna. 
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Table 10. Running time of different community detection results on Facebook network, 
measured in seconds 


Network ID |0 107 | 348 | 414 | 686 | 698 | 1684 1912 | 3437 | 3980 Average 


Louvain 0.15 | 1.83 | 0.12 0.06 | 0.09 | 0.02 | 0.80 1.28 | 0.31 | 0.01 | 0.47 
SIWO 0.34 | 3.78 | 0.31 0.16 | 0.17 | 0.03 | 1.46 | 3.79 | 0.51 | 0.02 1.06 
CESNA 9.76 | 103 | 6.02 2.47 | 3.12 | 0.63 | 38.3 | 22.9 | 21.1 |0.60 20.8 


Louvain-EET | 0.72 | 4.68 | 0.40 | 0.25 | 0.24 | 0.07 | 1.95 3.83 | 0.78 | 0.03 | 1.30 
Louvain-MT | 2.90 | 20.0 | 0.82 | 0.48 | 0.44 | 0.08 | 8.22 | 9.41 | 3.28 | 0.06 | 4.57 
SIWO-EET | 1.73 | 24.4 | 2.87 | 0.68 | 0.76 | 0.14 | 5.76 | 28.5 | 4.26 |0.12 | 6.92 
SIWO-MT 9.45 | 91.4 | 5.27 | 1.73 | 3.14 | 0.34 | 44.9 43.4 | 13.5 | 0.17 | 21.3 


Results. Results in terms of NMI, ARI, and running time are respectively pre- 
sented in Tables 8, 9, and 10. In terms of NMI, results in Table 8 show again 
that our late-fusion algorithms can significantly improve the community detec- 
tion accuracy upon Louvain. On average, the late fusion method Louvain+EET 
outperforms Louvain, SIWO, and CESNA by 30.8%, 42.2%, and 33.2% respec- 
tively. The late fusion method Louvain+MT outperforms the three by 14.1%, 
24.0%, and 16.2% respectively. However, all of the late-fusion methods perform 
poorly when evaluated by ARI. This is resulted from the goal of our late-fusion 
approach. Remember that we aim to find the set of communities such that nodes 
in the same subgroup are densely connected and similar in terms of attributes, 
whereas nodes residing in different communities are loosely connected and dis- 
similar in attributes. This purpose led the late-fusion approach to over-partition 
communities that are formed by only one of the two sources of information. The 
over-partitioning greatly hurts the results of ARI. A postprocessing model to 
resolve the over-partitioning issue with late fusion is left as a future work. The 
running time results shown in Table 10 again manifests the efficiency advantage 
of our late-fusion methods over CESNA. 


4.4 Effect of Parameter a 


In the Sina Weibo experiment, we see the advantage of having a weighting param- 
eter to accordingly leverage the strength of the two sources of information. In this 
section, we dive deeper into the effect of œ on the community detection results. 
To do so, we devise an experiment where we use the G'strong and Gweak intro- 
duced in Table 1. In reverse, we assign weak attributes to G'strong and strong 
attributes to Gweak. Then we perform our late fusion algorithm on these two 
graphs with varying a values. In our experiment, we choose SIWO as F, and 
k-means as F4. 

Table1l presents the NMI and ARI of the late fusion with SIWO and k- 
means when q varies. Gstrong has communities with a strong structure but weak 
attributes, so the accuracy score for NMI and ARI goes up as we put more weight 
on the structure; On the contrary, Gweak has weak structural communities but 
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Table 11. Effect of a 


a = 0.0 a= 0.2 a=0.5 a= 0.8 a= 1.0 

NMI | ARI | NMI | ARI | NMI |ARI | NMI | ARI | NMI | ARI 
Gstrong | 0.530 | 0.359 | 0.530 | 0.359 | 0.756 | 0.513 | 0.836 | 0.850 | 0.836 | 0.850 
Gweak | 0.867 | 0.834 | 0.867 | 0.834 | 0.762 | 0.470 | 0.526 | 0.364 | 0.526 | 0.364 


strong attributes, hence the accuracy score decreases as a increases. One can also 
notice that when a is sufficiently high or low, late fusion becomes equivalent to 
using community detection or clustering only, which is in accordance with our 
observation done on the Sina Weibo experiment. 

In practice, when network communities are mainly determined by the links, 
a should be greater than 0.5; œ < 0.5 is recommended if attributes play a 
more important role in forming the communities; When prior knowledge about 
network communities is unavailable or both sources of information contribute 
equally, œ should be 0.5. 


4.5 Complexity of Late Fusion 


It is a known drawback of attributed community detection algorithms that they 
are very time-consuming due to the need to consider node attributes. Our late- 
fusion method tries to circumvent this problem by taking advantage of the exist- 
ing community detection and clustering algorithms that are efficiently optimized, 
and combining their results by a simple approach. To further show the computa- 
tional efficiency of our late-fusion method, we compute the running time of the 
late-fusion method and compare it with other methods. 


Louvain 
SIWO 

Late Fusion 
l-Louvain 


g 
3 


Running time in seconds 
è 
o 


2000 4000 6000 8000 10000 
Number of nodes 


Fig. 2. Running time of Louvain, SIWO, late fusion and I-Louvain on networks of 
different sizes 
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We test the running time of four different community detection methods on 
five graphs with the number of nodes varying from 2000, 4000, 6000, 8000, and 
10000. These graphs are also generated by the attributed graph generator [10]. 
We control the modularity of each graph at the range of 0.64—0.66 and keep 
other hyperparameters the same. For each size, we randomly sample 10 graphs 
from the graph generator and plot the average running time of each method. As 
we can see in Fig. 2, it is expected that our late-fusion method is inevitably slower 
than the two community detection methods that only utilize node connections. 
However, our algorithm runs way faster than the J-Louvain algorithm, albeit 
both being approximately linear in the growth of network sizes. 


5 Conclusion and Future Direction 


In this paper, we proposed a new approach to the problem of community detec- 
tion in attributed networks that follows a late-fusion strategy. We showed with 
extensive experiments that most often, our late-fusion method is not only able 
to improve the detection accuracy provided by traditional community detec- 
tion algorithms, but it can also outperform the chosen contenders in terms of 
both accuracy and efficiency. We learned that combining node connections with 
attributes to detect communities of a network is not always the best solution, 
especially when one side of the network properties is strong while the other 
is weak, using only the best information available can lead to better detection 
results. It is part of our future work to understand when and how we should use 
the extra attribute information to help community detection. ARI suffers greatly 
from over-partitioning issue with our late fusion when applied to networks with 
binary attributes. A postprocessing model to resolve this issue is desired. We also 
hope to expand the late-fusion approach to networks with a hybrid of binary and 
numeric attributes as well as networks with overlapping communities. 
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Abstract. In different application areas, the prediction of values that are hierar- 
chically related is required. As an example, consider predicting the revenue per 
month and per year of a company where the prediction of the year should be 
equal to the sum of the predictions of the months of that year. The idea of rec- 
onciliation of prediction on grouped time-series has been previously proposed 
to provide optimal forecasts based on such data. This method in effect, models 
the time-series collectively rather than providing a separate model for time-series 
at each level. While originally, the idea of reconciliation is applicable on data 
of time-series nature, it is not clear if such an approach can also be applica- 
ble to regression settings where multi-attribute data is available. In this paper, 
we address such a problem by proposing Reconciliation for Regression (R4R), 
a two-step approach for prediction and reconciliation. In order to evaluate this 
method, we test its applicability in the context of Travel Time Prediction (TTP) 
of bus trips where two levels of values need to be calculated: (i) travel times of 
the links between consecutive bus-stops; and (ii) total trip travel time. The results 
show that R4R can improve the overall results in terms of both link TTP per- 
formance and reconciliation between the sum of the link TTPs and the total trip 
travel time. We compare the results acquired when using group-based reconcilia- 
tion methods and show that the proposed reconciliation approach in a regression 
setting can provide better results in some cases. This method can be generalized 
to other domains as well. 


Keywords: Regression - Reconciliation > Bus travel time 


1 Introduction 


Regression analysis provides a simple framework for predicting numerical target 
attributes from a set of independent predictive attributes. Addressing any problem using 
this framework requires designing models that fully capture the relations between pre- 
dictive and target attributes. This has so far led to many classes of regression models 
being designed. For instance, multi-target regression models [11] consider predicting 
© The Author(s) 2020 
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the value of multiple target attributes as opposed to basic regression models that aim 
at predicting only a single target attribute at a time. In another case, when one target 
variable is being predicted from a set of hierarchically ordered predictive attributes, the 
problem is known to be multi-level regression [5]. 

In this paper, we address the problem of regression for a class of problems where 
dependent variables are additionally hierarchically organized following different levels 
of aggregation. An example is the revenue forecasts per month and also per year of a 
given company. The forecasts for the new year can be the sum of the predictions done 
for each of the twelve months of the new year or can be done directly for the full new 
year. However, in many situations, it is important that the sum of the prediction per 
month is equal to the prediction for the full year. Moreover, relevant questions in this 
regard can arise. Can we obtain better predictions using both predictions for all months 
and for the full year? How may we reconcile the sum of the predictions done per month 
with the prediction done for the full year? Authors of [8] answered these questions for 
hierarchies of time series, i.e., a sequence of values, typically equally spaced, where 
this sequence can be aggregated by a given dimension. 

This notion of hierarchy can also exist in the regression setting i.e., a problem with 
a set of n instances (Xj, yi),i = 1,...,n. Each (Xj, yi) instance has a vector X; with 
p predictive attributes (x;,,2;,,...,2;,) and a quantitative target attribute y;. The hier- 
archy can exist in this regression setting when, for instance, two of the p predictive 
attributes have a 1-to-many relation as referred to in relational databases. Addressing 
this problem in the regression setting leads to more flexible and robust solutions com- 
pared to the time series approach because: (1) any number of observations per time 
interval can be defined; (2) there are no limitations to the time interval between consec- 
utive observations; and (3) any other type of predictive attribute can be used to better 
explain the target attribute. 

In this work, we present an approach to reconcile predictions in the regression set- 
ting. We achieve this by proposing a new method named Reconciling for Regression 
(R4R). The R4R method is tested for the bus travel time prediction problem. This prob- 
lem considers that buses run in predefined routes, and each route is composed of several 
links. Each link is the road stretch between two consecutive bus stops. Reconciling the 
predictions in this problem aims at reconciling the sum of the predictions done for each 
link with the prediction done for the full route. According to the authors’ knowledge, 
this is the first work on reconciling predictions in the regression setting. This work is 
also different from multi-target and multi-level variants being a combination of both 
(having multiple targets that are hierarchically ordered). 

The R4R method can be applied to any other regression problem which exhibits 
a one-to-many relationship between instances, and also where the aggregated target 
value (the one) is the sum of the detailed target values (the many). In the previous 
example: (1) the revenue forecasts for the new year, the many component targets are 
the revenue forecasts per month, and the one component target is the revenue forecast 
for the full year; (2) in the bus travel time example, the many component targets are the 
link predictions while the one component target is the full route prediction. In this paper, 
we only discuss the sum as aggregation criterion (the one should be equal to the sum 
of the many), but the proposed method could be easily extended to other aggregation 
criteria, e.g., the average. 
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The remainder of this paper is organized as follows. In Sect. 2, we present the pre- 
vious work on reconciling predictions. Section 3 elaborates the proposed methodology. 
In Sect. 4, we describe the case study. The results of the case study are presented and 
discussed in Sect. 5. Finally, the conclusions are presented in Sect. 6. 


2 Literature Review 


In this section, we review the previous research, both considering (i) the methods for 
forecasting for hierarchically organized time-series data and (ii) application area of 
travel time prediction. 


Methods for Forecasting Hierarchically Organized Data: Common methods used 
to reconcile predictions for hierarchically organized time-series data can be further 
grouped into three categories: bottom-up, top-down and middle-out, based on the level 
which is predicted first. Bottom-up strategies forecast all the low-level target attributes 
and use the sum of these predictions as the forecast for the higher-level attribute. On 
the contrary, top-down approaches predict the top-level attribute and then splits up the 
predictions for the lower level attributes based on historical proportions that may be 
estimated. For time-series data with more than two levels of hierarchy, a middle-out 
approach can be used, combining both bottom-up and top-down approaches [3]. These 
methods form linear mappings from the initial predictions to reconciled estimates. As 
a consequence, the sum of the forecasts of the components of a hierarchical time series 
is equal to the forecast of the whole. However, this is achieved without guaranteeing 
an optimal solution. Authors of [8] presented a new framework for optimally reconcil- 
ing forecasts of all series in a hierarchy to ensure they add up. The method first com- 
putes the forecast independently for each level of the hierarchy. Afterward, the method 
provides a means for optimally reconciling the base forecasts so that they aggregate 
appropriately across the hierarchy. The optimal reconciliation is based on a general- 
ized least squares estimator and requires an estimation of the covariance matrix of the 
reconciliation errors. Using Australian domestic tourism data, authors of [8] compare 
their optimal method with bottom-up and conventional top-down forecast approaches. 
Results show that the optimal combinational approach and the bottom-up approach out- 
perform the top-down method. The same authors extended, in [9], the previous work 
proposed in [8] to cover non-hierarchical groups of time series, as well as, large groups 
of time series data with a partial hierarchical structure. A new combinational forecast- 
ing approaches is proposed that incorporates the information from a full covariance 
matrix of forecast errors in obtaining a set of aggregate forecasts. They use a weighted 
least squares method due to the difficulty of estimating the covariance matrix for large 
hierarchies. 

In [16], an alternative representation that involves inverting only a single matrix of 
a lower number of dimensions is used. The new combinational forecasting approach 
incorporates the information from a full covariance matrix of forecast errors in obtain- 
ing a set of aggregate consistent forecasts. The approach minimizes the mean squared 
error of the aggregate consistent forecasts across the entire collection of time series. 
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A game-theoretically optimal reconciliation method is proposed in [6]. The authors 
address the problem in two independent steps, by first computing the best possible fore- 
casts for the time series without taking into account the hierarchical structure and next 
to a game-theoretic reconciliation procedure to make the forecasts aggregate consistent. 

The previously mentioned methods are limited by the nature of the time-series app- 
roach they take. It is often impossible to take any advantage of additional features and 
attributes accompanying data with such an approach. Furthermore, many prevalent data 
imperfection problems such as missing data, lead to imperfect time-series. This fact 
reduces the applicability of time-series models that require equally distanced samples. 

In our work, we take advantage of additional features and the structure of the 
grouped data to improve and reconcile predictions. Instead of forecasting each time 
series independently and then combine the predictions, in a regression setting, we can 
reconcile future events using only some past events. This leads to a solution suitable for 
online applications. 


Application Area of Travel Time Prediction: There exists a considerable amount of 
research papers that address the problem of travel time prediction for transport appli- 
cations. Accurate travel time information is essential as it attracts more commuters and 
increases commuter’s satisfaction [1]. The majority of these works are on short-term 
travel time prediction [19], aimed at applications in advanced traveler information sys- 
tems. There are also works on long-term travel time prediction [13], which can be used 
as a planning tool for public transport companies or even for freight transports. 

Link travel time prediction can be used for route guidance [17], for bus bunching 
detection [14], or to predict the bus arrival time at the next station [18] which can 
promote information services about it. More recently, Global Positioning System (GPS) 
data is becoming more and more available, allowing its use to predict travel times from 
GPS trajectories. These trajectories can be used to construct origin-destination matrices 
of travel times or traffic flows, an important tool for mobility purposes [2]. 

Using both link travel time predictions and the full trip travel time prediction in 
order to improve all those predictions is a contribution of this paper for the transporta- 
tion field. 


3 The R4R Method 


3.1 Problem Definition 


Consider a dataset D = (X L, r). Note that X in this tuple denotes the set of predictive 
attributes and is a matrix of size N x Q representing a set of N number of instances 
each composed of Q number of predictive attributes. Furthermore, L is the set of the 
many component targets and is a matrix of size N x K with K being the number of 
elements of the many component target. r representes the set of one component target 
and is a vector of length N. Elements of {rn € r} represent the target attributes of the 
one component and each {ln € L} is the kth target attribute of the many component. 
Also, consider r, = = 3 ln,k denoting the sum of all the many component targets 
being equal to a corresponding one component target. 

Defining the prediction of each |, k aS Pn,ķ, We are looking for a model that ensures 
that the sum of the predictions of the many component target are as close as possible 
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to the rn. In other words, after making predictions, we want the following equation to 
hold: 


K 
DD y+ Tnn < N} (1) 


k=1 


3.2 Methodology 


In this section, we elaborate on our proposed method, Reconciling for Regression 
(R4R), to address the above-mentioned problem. R4R method is composed of two steps. 
In the first step, it learns models for prediction of the many component targets, sepa- 
rately. In the second step, it reconciles the many predictions with the one component. 

In order to improve the individual pn, predictions such that Eq. | holds, our pro- 
posed framework uses a modified version of the least squares optimization method to 
compute a set of corrective coefficients (see Eq. 4), that are used to update the individual 
Pn,k predictions. 


Step 1, Learning the Predictive Models: at the first step, the predictions of the many 
component targets are calculated using a specific base learning method. K different 
models are trained, one for each of the K elements of the many target component. 
It is possible to select a different learning method for each element to ensure higher 
accuracy. The resulting predictions for each of the K elements are referred to as Pm,k, 
where m is the instance number, and k identifies elements of the many component 
targets. Algorithm 1 depicts these steps. As a result, this algorithm creates an output P, 
a matrix of size M x K composed of predictions Pm,x. P is used in the second stage 
for reconciliation. 


Algorithm 1. Learning the predictive model 


Input: D (dataset matrix of size N x (Q + K)), Me (base learning method),y (a percentage 
value) 
Output: P (Predictions matrix of size M x K) 
1: Split Dataset D into Train set of size (1 — y) N and Test set of size M = y x N; 
2: for k = 1 to K do 
3: Train model; using Me to predict the kth element of the many component target; 
4 
5 


for m = 1 to M do 
Pm,k := Predict the value of mth instance of kth target in Test using modelx; / / Ppm,k 
denotes elements of P 


end for 
end for 
return P; 


O ON 


Step 2, Reconciling Predictions: In the second step, the framework updates the value 
of predictions resulted from the initial models used in Algorithm 1. This is achieved by 
estimating a corrective coefficient (0%) for each element of the many target component 
(Pm,k). This coefficient needs to be multiplied with the model predictions to ensure 
minimized error from the actual one component target (rm ) and many component target 
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(lm,k). We achieve this goal using a least-squares method on the current training dataset 
and using the objective functions given by Eqs. 2 and 3 to estimate 0 = (61, ..., 9x). 


K 
arg min( ` (9kPm,k) — Tm)”, m < M 2) 
Ib<O<ub 4—1 

K 
arg min X` (OkPm,k — lm), m < M (3) 
Ib<0 <ub 74 


The first objective function presented in Eq. 2 is attempting to optimize reconcili- 
ation based on the value of one component target. The second objective function pre- 
sented in Eq. 3 aims at minimizing the error of the predictions based on the value of 
each element of the many component targets, separately. Both of these objective func- 
tions can be combined and expanded to Eq. 4. In Eq. 4, the first M rows are representing 
the objective function presented in Eq. 2. The remaining M x (KM) rows represent the 
second objective function as provided in Eq. 3. 


[ P11 Pi, =+ Plk +*+ PLE rı 
p2,1 P2,2 *** P2,k *** P2,K r2 
Pm,1 Pm,2 `t Pm,k *** Pm,K Tm 
pıı 0 © 0 0 tha 

0 p2 0 0 l1,2 
s 6, r 
K x 2 : 05 5 
0 O- sas Go ate py re = lik (4) 
Pm 0 > 0 o OK lm,1 
0 Pmp: 0 0 Im,2 
0 O ++: Dmjk-+: 0 link 
Lo O + O ee Pm, K Lim, x 


As seen in Eqs. 2 and 3 we have defined a constraint on the values of 0. The aim 
is to regularize the modifications to the predictions done for each element of the many 
component targets in a sensible manner (e.g. negative factors cannot be allowed when 
negative predictions are not meaningful). Therefore, we assume, without loss of gen- 
erality, that all values of @ are positive, with lower (lb) and upper (ub) bound con- 
straints, 0 < lb < 6, < ub. Both lb and ub are free input parameters. We reduce 
the number of free parameters to one (a) by defining a symmetric bound region as 
(lb, up) = (1—a,1+a). 

The process of reconciliation on predictions is explained in Algorithm 2. In the final 
step of this algorithm, the prediction matrix for all elements of the many component 
targets is updated using the corrective coefficients 0. A Least Squares method is used to 
calculate corrective coefficients. To allow robustness against outliers, we suggest using 
a nk number of nearest neighbors for estimating 8. We assume that similar trips from 
the past have the same behavior, as shown in [12]. The new predictions are defined as 
Pew. The algorithm takes into account the information of the predictions for both the 
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Algorithm 2. Reconciling predictions 


Input: P (Predictions matrix of size M x K), nk (number of nearest neighbors), lb, wp (lower 
and upper bounds for 0s) 

Output: P,,-~ (new predictions matrix of size M x K), 0 (vector of corrective coefficients) 

: fork = 1 to K do 
get nk nearest neighbor for each prediction; 
Calculate 0 using the Least Squares method with Bounds (lb,up) according to Eq. 4; 
Prew =P-0 

end for 

return Pew, 0 


Noe ee 


many component elements and the one component predicted from similar instances in 
P,,,,x, in order to verify Eq. 1 on reconciliation. 


Table 1. Characteristics of tested STCP bus routes 


Bus line | Origin — Destiny #Stops | #Trips 
L200 Bolhao — Castelo do Queijo | 30 2526 
L201 Viso — Aliados 26 2453 
L305 Cordoaria — Hospital S. Joao | 22 3126 
L401 Bolhão — S. Roque 26 4476 
L502 Bolhão — Matosinhos 32 5966 
L900 Trindade — S. Odivio 34 219 


4 Case Study 


To test the methodology explained in Sect. 3.2, we conduct a series of experiments using 
a real dataset that has our desired hierarchical organization of target values. Measuring 
travel time in public transport systems can produce such a dataset. Being able to per- 
form accurate Travel Time Prediction (TTP) is an important goal for public transport 
companies. On the one hand, travel time prediction of the link between two consecutive 
stops (the many component targets in our model) allows timely informing the roadside 
users about the arrival of buses at bus stops (in the rest of this paper we refer to this 
value as link TTP). On the other hand, total trip travel time prediction (the one compo- 
nent in our model) is useful to better schedule drivers’ duty services (in the rest of this 
paper, we refer to this value as total TTP) [4]. 

The dataset used in this section is provided by the Sociedade de Transportes Colec- 
tivos do Porto (STCP), the main mass public transportation company in Porto, Portugal. 

The experiments described in the following sections are based on the data collected 
during a period from January 1st to March 30th of 2010 from six bus routes (shown in 
Table 1). All the six selected bus routes operate between 5:30 a.m. to 2:00 a.m. However, 
we have considered only bus trips starting after 6 a.m. 
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The collected dataset has multiple nominal and ordinal attributes that make it suit- 
able for defining a regression problem. We have selected five features that characterize 
each bus trip: (1) WEEKDAY: the day of the week {Monday, Tuesday, Wednesday, 
Thursday, Friday, Saturday, Sunday}; (2) DAYTYPE: the type of the day {holiday, nor- 
mal, non-working day, weekend holiday}; (3) Bus Day Month: {1,...,31}; (4) Shift ID; 
(5) Travel ID. 

We have implemented R4R using the R Software [15] and the /sq_linear routine from 
Scipy Python library [10]. For the first stage of R4R, as depicted in Algorithm 1, we use 
a simple multivariate linear regression as a base learning method. We refer to this base 
learning method as (Bas). We further split data according to the following format. A 30 
days window length is used for selecting training samples, and a 60 days window length 
is considered for selecting test samples. 

In our experiments, the parameter a used for determining the lower and upper bound 
for the parameter for estimating @ varies from 0.01 to 0.04, which corresponds to 0.96— 
1.04, minimum and maximum values that 0 can take, respectively. 


5 Comparative Study 


5.1 Can Reconciliation Be Achieved Using R4R? 


Firstly, using the proposed R4R method, we try to answer the following question: is it 
possible to use the total trip travel time to improve the link TTPs guaranteeing a better 
reconciliation between the sum of the link TTPs and the total TTP simultaneously? To 
answer this question we measure the relative performance improvement achieved by 
R4R compared to a multivariate linear regression as the base learning method (denoted 
by Bas). 

We evaluate the performance in predicting the following metrics (i) link travel time 
prediction (LP), the sum of link travel time predictions (SFP), and full trip time predic- 
tion (FP). Methods are compared based on Root Mean Square Error (RMSE) as defined 
in Eq. 5. 


1 Netest 
RMSE = (ĝi — yi)? (5) 


test i=1 


where y; and ĝ; represent the target and predicted bus arrival times, for the 7th example 
in the test set, respectively. N¢esz is the total number of test samples. For link travel time 
prediction indicator, LP, the mean of the RMSE of each bus link is considered. 

Results of the comparison of R4R and Bas are presented in Fig. 1. Please note that 
relative gains are presented for the sake of readability of graphs. The duration of travel- 
times varies widely. This fact leads to unreadable graphs when actual data is presented. 

As seem, R4R outperforms the base multivariate regression model in all cases. This 
comparison answers the question posed earlier. R4R improves predictions of the base 
regression learning method, guaranteeing a better reconciliation between the sum of the 
link TTPs and the total trip travel time, simultaneously. 
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Relative Improvements 
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Fig. 1. Relative improvement of R4R (Res) relative to Baseline (Bas) for mean LP - Link Predic- 
tion (red), sum of the link travel time predictions (green) and the full trip time prediction (blue). 
(Color figure online) 


5.2 How Does R4R Perform Against Baselines Made for Time Series Data? 


We continue our experiments by comparing our proposed methodology R4R with the 
methods proposed by Hyndman et al. in the recent related works [8,9,16] denoted by 
(H2011, W2015, and H2016). To compare with these works, we used the available 
implementation in the R package [7]. It should be considered that these baseline models 
are designed for time-series data. Therefore, in order to perform comparisons with these 
approaches, we also define a time series problem using this dataset. This is achieved by 
representing data in the form of a time series with a resolution of a one-hour interval. 
We compute the mean link travel time for each hour between 6:00 a.m. to 2:00 a.m the 
next day, i.e. 20 data points in total for each “bus day”. In the majority of the cases, 
each interval has more than one link travel time. For this reason we averaged the link 
travel times for each hour. Because the dataset has a considerable amount of missing 
values, interpolation was used to fill the missing links’ travel times. However, the results 
presented in the paper do not take into account the predictions done for intervals with 
no data. 

The above-mentioned pre-processing tasks that were necessary in order to use the 
approaches proposed by Hyndman et al. already suggest that it is viable to propose 
methods such as R4R that perform in a more general and flexible regression setting. 
Indeed, the discretization of data into a time-series format implies the need to make pre- 
dictions for intervals instead of point-wise predictions as done in the regression setting. 
Discretization also implies the necessity of filling missing data when the intervals have 
no data instances. This problem can be prevented by considering larger intervals. How- 
ever, larger intervals imply loss of details. Moreover, the regression setting deals natu- 
rally with additional attributes that can partially explain the value of the target attribute. 


322 J. Mendes-Moreira and M. Baratchi 


Bus Line L305 


RMSE 


300 

200 Model 
(sas 
E H2011 
E H2016 
E R4R 
E w2015 

100- | 

. O n 
2 3 4 5 6 7 8 9 15 16 17 18 19 


i fo 11 12 13 14 20 21 SUM 


Stop Links 
Fig. 2. RMSE for each of the Link Travel Time Predictions of R4R against the methods proposed 
in H2011 [8], W2015 [16], H2016 [9] applied to bus route L305. SUM is the RMSE of the 


sum of the LTT prediction for the entire trip against the full trip time. This plot shows the results 
before the bus starts its journey. 


Table 2. Overall mean RMSE for each model, H2011 [8], W2015 [16], H2016 [9] and the new 
proposed approach R4R. BL - the Bus Line, LP - mean of the RMSE of Link Predictions, STP - 
RMSE of the sum of the LTT prediction for the entire trip against the full trip time. 


BL MODEL LP STP BL MODEL LP STP BL MODEL LP STP 

L200 BAS 277.88 318.11 L201 BAS 41.56 354.11 | L305 BAS 48.37 327.69 
L200 R4R 277.76 309.82 | L201 R4R 41.43 346.52 | L305 R4R 48.25 321.81 
L200 H2011 51.50 865.38 | L201 H2011 42.01 321.80) L305 H2011 48.59 297.27 
L200 H2016 44.40 496.71 | L201 H2016 41.69 314.09 | L305 H2016 48.49 295.97 
L200 W2015 37.38 319.22 | L201 W2015 41.64 308.85|L305 W2015 48.40 296.41 
L401 BAS 29.26 239.11 | L502 BAS 42.62 385.34 | L900 BAS 58.60 401.89 
L401 R4R 29.17 234.29) L502 R4R 42.50 375.75 | L900 R4R 58.60 395.79 
L401 H2011 26.87 193.29 | L502 H2011 47.27 264.14) L900 H2011 48.20 432.25 
L401 H2016 26.80 192.65 | L502 H2016 46.69 270.02 | L900 H2016 48.24 420.58 
L401 W2015 26.76 192.38 L502 W2015 46.82 287.71|L900 W2015 48.08 403.34 


Figure 2 presents the results of predictions for bus route L305. It should be men- 
tioned that we have chosen to show only results for a = 0.01, the parameter that con- 
sistently gave us the best performance in all the experiments we did. Indeed, the errors 
increase with increasing values of a in all experiments we did. The results show very 
small differences between the methods under study. 

The data provided is not homogeneous. This can adversely affect the performance of 
the least-squares method when outlying data is used to find the corrective coefficients 0. 
To avoid such problems, in our proposed framework, we select the nk number of nearest 
neighbors for each bus trip (also presented in Algorithm 2). Thus, after each link travel 
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time prediction, it is necessary to recompute the whole process, i.e., to select a new set 
of similar bus trips and further find the coefficients using the least-squares method and 
update the predictions. Comparing with Hyndman et al. works, this process leads to a 
more computationally expensive solution. It is also important to find a suitable value for 
nk. During our experiments, we observed that the best results are achieved for nk = 3. 
Therefore, all results presented in this paper are based on nk = 3. 

Table 2 shows the general results of predictions using this approach for all bus routes 
tested using multivariate linear regression as the base learning method (Bas). The results 
show that R4R outperforms Bas in all cases. There are a number of cases where a ver- 
sion of the time series model proposed by Hyndman. et al. perform better than R4R. 
These differences can be explained when considering the simple linear regression algo- 
rithm we used as a base learner in Algorithm 1. A linear model cannot find non-linear 
relations between features. Technically, the performance of R4R can be improved fur- 
ther as it allows using any other regression method. Furthermore, using extra features, 
such as weather conditions, could possibly improve the performance of R4R even fur- 
ther. However, the methods proposed by Hyndman et al. cannot benefit from using extra 
features. 


6 Conclusion 


In this paper, we study the problem of the reconciliation of predictions in a regression 
setting. We presented a two-stage prediction framework for prediction and reconcilia- 
tion. In order to evaluate the performance and applicability of this method, we conduct 
a set of experiments using a real dataset collected from buses in Porto, Portugal. The 
results demonstrate that R4R improves the predictions of the base learning method. 
RAR is also able to further improve the reconciliation of the link TTPs after each itera- 
tion in an online manner. However, this is not shown due to space constraints. We also 
compare the results achieved in a regression setting with that of a time-series approach. 
In the case study discussed in this paper, R4R is able to reduce the error of link TTPs 
and increase reconciliation. An important advantage of the R4R method compared to 
time series variants is that it provides a flexible framework that can take advantage of 
any regression model and additional features accompanying data. Furthermore, R4R is 
not affected by data imperfection problems such as missing data, that reduce the appli- 
cability of time-series models that require equally distanced samples. 
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Abstract. Manifold regularization is a commonly used technique in 
semi-supervised learning. It enforces the classification rule to be smooth 
with respect to the data-manifold. Here, we derive sample complexity 
bounds based on pseudo-dimension for models that add a convex data 
dependent regularization term to a supervised learning process, as is in 
particular done in Manifold regularization. We then compare the bound 
for those semi-supervised methods to purely supervised methods, and 
discuss a setting in which the semi-supervised method can only have a 
constant improvement, ignoring logarithmic terms. By viewing Manifold 
regularization as a kernel method we then derive Rademacher bounds 
which allow for a distribution dependent analysis. Finally we illustrate 
that these bounds may be useful for choosing an appropriate manifold 
regularization parameter in situations with very sparsely labeled data. 


Keywords: Semi-supervised learning - Learning theory - Manifold 
regularization 


1 Introduction 


In many applications, as for example image or text classification, gathering unla- 
beled data is easier than gathering labeled data. Semi-supervised methods try 
to extract information from the unlabeled data to get improved classification 
results over purely supervised methods. A well-known technique to incorporate 
unlabeled data into a learning process is manifold regularization (MR) [7,18]. 
This procedure adds a data-dependent penalty term to the loss function that 
penalizes classification rules that behave non-smooth with respect to the data 
distribution. This paper presents a sample complexity and a Rademacher com- 
plexity analysis for this procedure. In addition it illustrates how our Rademacher 
complexity bounds may be used for choosing a suitable Manifold regularization 
parameter. 

We organize this paper as follows. In Sects. 2 and 3 we discuss related work 
and introduce the semi-supervised setting. In Sect.4 we formalize the idea of 
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adding a distribution-dependent penalty term to a loss function. Algorithms 
such as manifold, entropy or co-regularization [7, 14,21] follow this idea. Section 5 
generalizes a bound from [4] to derive sample complexity bounds for the proposed 
framework, and thus in particular for MR. For the specific case of regression, 
we furthermore adapt a sample complexity bound from [1], which is essentially 
tighter than the first bound, to the semi-supervised case. In the same section we 
sketch a setting in which we show that if our hypothesis set has finite pseudo- 
dimension, and we ignore logarithmic factors, any semi-supervised learner (SSL) 
that falls in our framework has at most a constant improvement in terms of 
sample complexity. In Sect. 6 we show how one can obtain distribution dependent 
complexity bounds for MR. We review a kernel formulation of MR [20] and show 
how this can be used to estimate Rademacher complexities for specific datasets. 
In Sect. 7 we illustrate on an artificial dataset how the distribution dependent 
bounds could be used for choosing the regularization parameter of MR. This is 
particularly useful as the analysis does not need an additional labeled validation 
set. The practicality of this approach requires further empirical investigation. In 
Sect. 8 we discuss our results and speculate about possible extensions. 


2 Related Work 


In [13] we find an investigation of a setting where distributions on the input 
space œ are restricted to ones that correspond to unions of irreducible algebraic 
sets of a fixed size k € N, and each algebraic set is either labeled 0 or 1. A SSL 
that knows the true distribution on ¥ can identify the algebraic sets and reduce 
the hypothesis space to all 2% possible label combinations on those sets. As we 
are left with finitely many hypotheses we can learn them efficiently, while they 
show that every supervised learner is left with a hypothesis space of infinite VC 
dimension. 

The work in [18] considers manifolds that arise as embeddings from a circle, 
where the labeling over the circle is (up to the decision boundary) smooth. 
They then show that a learner that has knowledge of the manifold can learn 
efficiently while for every fully supervised learner one can find an embedding 
and a distribution for which this is not possible. 

The relation to our paper is as follows. They provide specific examples where 
the sample complexity between a semi-supervised and a supervised learner are 
infinitely large, while we explore general sample complexity bounds of MR and 
sketch a setting in which MR can not essentially improve over supervised methods. 


3 The Semi-supervised Setting 


We work in the statistical learning framework: we assume we are given a feature 
domain ¥ and an output space Y together with an unknown probability distri- 
bution P over ¥ x Y. In binary classification we usually have that Y = {—1, 1}, 
while for regression Y = R. We use a loss function ¢ : R x VY — R, which is 
convex in the first argument and in practice usually a surrogate for the 0-1 loss 
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in classification, and the squared loss in regression tasks. A hypothesis f is a 
function f : Æ — R. We set (X,Y) to be a random variable distributed accord- 
ing to P, while small x and y are elements of ¥ and Y respectively. Our goal is 
to find a hypothesis f, within a restricted class F, such that the expected loss 
Q(f) := E[é(f(X), Y)] is small. In the standard supervised setting we choose 
a hypothesis f based on an iid. sample Sn = {(2i, Yi) }iegi,...n} drawn from 
P. With that we define the empirical risk of a model f € F with respect to ¢ 
and measured on the sample Sn as Q(f, Sn) = 17, O(f (ai), yi). For ease of 
notation we sometimes omit S» and just write Q( f). Given a learning problem 
defined by (P, F, ġ) and a labeled sample Sn, one way to choose a hypothesis is 
by the empirical risk minimization principle 


fsup = arg min Q(f, Sn). (1) 


We refer to fsup as the supervised solution. In SSL we additionally have samples 
with unknown labels. So we assume to have n + m samples (;, Yi)ie{1,...n+m} 
independently drawn according to P, where y; has not been observed for the 
last m samples. We furthermore set U = {21,...,2x,+4m}, so U is the set that 
contains all our available information about the feature distribution. 

Finally we denote by m+ (e, 6) the sample complexity of an algorithm L. That 
means that for all n > m” (e,ô) and all possible distributions P the following 
holds. If L outputs a hypothesis fz after seeing an n-sample, we have with 
probability of at least 1 — 6 over the n-sample Sn that Q( ft) — min Q(f) <e. 


4 A Framework for Semi-supervised Learning 


We follow the work of [4] and introduce a second convex loss function Y : Fx¥ > 
R+ that only depends on the input feature and a hypothesis. We refer to w as 
the unsupervised loss as it does not depend on any labels. We propose to add 
the unlabeled data through the loss function ~ and add it as a penalty term to 
the supervised loss to obtain the semi-supervised solution 


n+m 


fsemi = = nemin Do (xi) Ui) + a 2 w( f, zi) (2) 


where A > 0 controls the trade-off between the eae and the unsupervised 
loss. This is in contrast to [4], as they use the unsupervised loss to restrict the 
hypothesis space directly. In the following section we recall the important insight 
that those two formulations are equivalent in some scenarios and we can use [4] 
to generate sample complexity bounds for the here presented SSL framework. 

For ease of notation we set R(f,U) = ain ya w(f,2;) and Rif) = 
i[¢(f, X)]. We do not claim any novelty for the idea ‘of adding an unsupervised 
loss for regularization. A different framework can be found in [11, Chapter 10]. 
We are, however, not aware of a deeper analysis of this particular formulation, as 
done for example by the sample complexity analysis in this paper. As we are in 
particular interested in the class of MR schemes we first show that this method 
fits our framework. 
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Example: Manifold Regularization. Overloading the notation we write now P(X) 
for the distribution P restricted to ¥. In MR one assumes that the input dis- 
tribution P(X) has support on a compact manifold M C Æ and that the 
predictor f € F varies smoothly in the geometry of M [7]. There are sev- 
eral regularization terms that can enforce this smoothness, one of which is 
Jur IVa f(2)|/?dP(x), where V mf is the gradient of f along M. We know that 
Ju WV a f(x)\\?dP(«) may be approximated with a finite sample of ¥ drawn 
from P(X) [6]. Given such a sample U = {2,...,%n4m} one defines first a 
weight matrix W, where Wi; = e—llti—2,lI’/7, We set L then as the Laplacian 
matrix L = D — W, where D is a diagonal matrix with Dj = Drip Wij. 
Let furthermore fy = (f(x1),...,f(@nim))* be the evaluation vector of f on 
U. The expression Gay loLfu = wamez Lig (f(#i) — f(xz))? Wij converges 
to fy ||Varf||?dP(x) under certain conditions [6]. This motivates us to set the 
unsupervised loss as w(f, (wi,2;)) = (f(xi) — f(xj))?Wij. Note that ft, Lfu is 
indeed a convex function in f: As L is a Laplacian matrix it is positive definite 
and thus fé Lfu defines a norm in f. Convexity follows then from the triangle 
inequality. 


5 Analysis of the Framework 


In this section we analyze the properties of the solution fem; found in Equation 
(2). We derive sample complexity bounds for this procedure, using results from 
[4], and compare them to sample complexities for the supervised case. In [4] 
the unsupervised loss is used to restrict the hypothesis space directly, while we 
use it as a regularization term in the empirical risk minimization as usually 
done in practice. To switch between the views of a constrained optimization 
formulation and our formulation (2) we use the following classical result from 
convex optimization [15, Theorem 1]. 


Lemma 1. Let (f(x), y) and w(f,x) be functions convex in f for allx,y. Then 
the following two optimization problems are equivalent: 


1 n 1 ntm 
mi P 2 olf (2), yi) + A Em >, anf, xi) (3) 
1 n n+m 
min R D b(f (zi), yi) subject to >, man vi) ST (4) 


Where equivalence means that for each X we can find aT such that both problems 
have the same solution and vice versa. 


For our later results we will need the conditions of this lemma are true, which 
we believe to be not a strong restriction. In our sample complexity analysis we 
stick as close as possible to the actual formulation and implementation of MR, 
which is usually a convex optimization problem. We first turn to our sample 
complexity bounds. 
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5.1 Sample Complexity Bounds 


Sample complexity bounds for supervised learning use typically a notion of com- 
plexity of the hypothesis space to bound the worst case difference between the 
estimated and the true risk. As our hypothesis class allows for real-valued func- 
tions, we will use the notion of pseudo-dimension Pdim(F, ¢), an extension of the 
VC-dimension to real valued loss functions ¢ and hypotheses classes F [17,22]. 
Informally speaking, the pseudo-dimension is the VC-dimension of the set of 
functions that arise when we threshold real-valued functions to define binary 
functions. Note that sometimes the pseudo-dimension will have as input the loss 
function, and sometimes not. This is because some results use the concatenation 
of loss function and hypotheses to determine the capacity, while others only use 
the hypotheses class. This lets us state our first main result, which is a gener- 
alization of [4, Theorem 10] to bounded loss functions and real valued function 
spaces. 


Theorem 1. Let F} := {f € F | E[u(f,x)| < T}. Assume that ¢, Y are measur- 
able loss functions such that there exists constants Bı, By > 0 with w(f,2) < By 
and $(f(x),y) < Bo for all x,y and f € F and let P be a distribution. Further- 
more let f* = arg a Q(f). Then an unlabeled sample U of size 

EF} 


B? f 1 4B 
m > ŽE fin È + 2 Paim (F, 4) m “2 +1] (5) 
€ € 
and a labeled sample Sn of size 
8B2 [ 8 ee 4By h 
n> max ( 2 fn + Paim”, 0) 22 1 Pi (6) 


is sufficient to ensure that with probability at least 1 — ô the classifier g € F that 
minimizes Q(-, Sn) subject to R(,U) < T+ § satisfies 


Q(g) < QUZ) +e. (7) 


Sketch Proof: The idea is to combine three partial results with a union bound. 
For the first part we use Theorem 5.1 from [22] with h = Pdim(F, y) to show 
that an unlabeled sample size of 


B? f 1 4B 
sA ee ania yi (8) 


m 
e2 å € 


is sufficient to guarantee Ê(f)— R(f) < 5 for all f € F with probability at least 
1— È, In particular choosing f = f* and noting that by definition R(fž) < T we 
conclude that with the same probability 


BSE (9) 
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For the second part we use Hoeffding’s inequality to show that the labeled sample 
size is big enough that with probability at least 1 — 4 it holds that 


QU) < QGH) + Bafa) (10) 


2n 
The third part again uses Th. 5.1 from [22] with h = Pdim(F¥, ¢) to show that 
n > 882° [in 8 + 2h In 422 + 1] is sufficient to guarantee Q(f) < Q(f) + § with 
probability at least 1 — 3. 
Putting everything together with the union bound we get that with proba- 
bility 1 — ô the classifier g that minimizes Q(-, X,Y) subject to R(-,U) <T+5 
satisfies 


€ In($) 
Qn 


(11) 


Finally the labeled sample size is big enough to bound the last rhs term by 5. 

The next subsection uses this theorem to derive sample complexity bounds 
for MR. First, however, a remark about the assumption that the loss function 
@ is globally bounded. If we assume that F is a reproducing kernel Hilbert 
space there exists an M > 0 such that for all f € F and x € ¥ it holds that 
|f(x)| < M||f||F. If we restrict the norm of f by introducing a regularization 
term with respect to the norm ||.||#, we know that the image of F is globally 
bounded. If the image is also closed it will be compact, and thus @ will be 
globally bounded in many cases, as most loss functions are continuous. This can 
also be seen as a justification to also use an intrinsic regularization for the norm 
of f in addition to the regularization by the unsupervised loss, as only then 
the guarantees of Theorem 1 apply. Using this bound together with Lemma 1 we 
can state the following corollary to give a PAC-style guarantee for our proposed 
framework. 


Corollary 1. Let ¢ and w be convex supervised and an unsupervised loss func- 
tion that fulfill the assumptions of Theorem 1. Then fsemi (2) satisfies the guar- 
antees given in Theorem 1, when we replace for it g in Inequality (7). 


Recall that in the MR setting R(f) = mime et” Wis (F(a) — f(a,))?. So we 
gather unlabeled samples from ¥ x ¥ instead of X. Collecting m samples from 
X equates m? — 1 samples from ¥ x X and thus we only need ym instead of m 
unlabeled samples for the same bound. 


5.2 Comparison to the Supervised Solution 


In the SSL community it is well-known that using SSL does not come without a 
risk [11, Chapter 4]. Thus it is of particular interest how those methods compare 
to purely supervised schemes. There are, however, many potential supervised 
methods we can think of. In many works this problem is avoided by comparing 
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to all possible supervised schemes [8, 12,13]. The framework introduced in this 
paper allows for a more fine-grained analysis as the semi-supervision happens 
on top of an already existing supervised methods. Thus, for our framework, it 
is natural to compare the sample complexities of fgup with the sample complex- 
ity of fsemi- To compare the supervised and semi-supervised solution we will 
restrict ourselves to the square loss. This allows us to draw from [1, Chapter 20], 
where one can find lower and upper sample complexity bounds for the regres- 
sion setting. The main insight from [1, Chapter 20] is that the sample complexity 
depends in this setting on whether the hypothesis class is (closure) convex or 
not. As we anyway need convexity of the space, which is stronger than closure 
convexity, to use Lemmal, we can adapt Theorem 20.7 from [1] to our semi- 
supervised setting. 


Theorem 2. Assume that Fri is a closure convex class with functions mapping 
to [0,1], that Y(f,£) < Bı for all x € X and f € F and that ọ( f(x), y) = 
(f(x) — y)?. Assume further that there is a By > 0 such that (f(x) — y)? < Bo 
almost surely for all (x,y) € XXY andf € Fhe Then an unlabeled sample 
size of 


> ln = + 2 Pdim(F, y) In 
e ô 


€ 


2B,’ | 8 2B, +2] (12) 


and a labeled sample size of 


2 VB 2 
n>O (= (Pairt, om 2 tln =)) (13) 
€ € 


is sufficient to guarantee that with probability at least 1 — 6 the classifier g that 
minimizes Q(-) w.r.t R(f) < T + e€ satisfies 


Q(g) < min Q(f) + €. (14) 


JEF? 


Proof: As in the proof of Theorem 1 the unlabeled sample size is sufficient to 
guarantee with probability at least 1— 8 that R(f*) < 7+. The labeled sample 
size is big enough to guarantee with at least 1 — Š that Q(g) < Q(f%,.) +€ 
[1, Theorem 20.7]. Using the union bound we have with probability of at least 
1—6 that Q(g) < Q( fry.) te < QUft) +. 

Note that the previous theorem of course implies the same learning rate in 
the supervised case, as the only difference will be the pseudo-dimension term. 
As in specific scenarios this is also the best possible learning rate, we obtain the 
following negative result for SSL. 


Corollary 2. Assume that ¢ is the square loss, F maps to the interval [0,1] 
and Y = [1 — B,B) fora B > 2. If F and FË are both closure convex, then 


T 


for sufficiently small ¢,5 > 0 it holds that m“? (e,6) = O(ms°™(e,6)), where 


1 In the remarks after Theorem 1 we argue that in many cases |f(x)| is bounded, and 
in those cases we can always map to [0,1] by re-scaling. 
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O suppresses logarithmic factors, and m°°™, ms“? denote the sample complexity 
of the semi-supervised and the supervised learner respectively. In other words, 
the semi-supervised method can improve the learning rate by at most a constant 
which may depend on the pseudo-dimensions, ignoring logarithmic factors. Note 
that this holds in particular for the manifold regularization algorithm. 


Proof: The assumptions made in the theorem allow is to invoke Equation (19.5) 
from [1] which states that m’°™ = Q(4 + Pdim(F)).? Using Inequality (13) 
as an upper bound for the supervised method and comparing this to Eq. (19.5) 
from [1] we observe that all differences are either constant or logarithmic in e€ 
and 6. 


5.3 The Limits of Manifold Regularization 


We now relate our result to the conjectures published in [19]: A SSL cannot learn 
faster by more than a constant (which may depend on the hypothesis class F and 
the loss ¢) than the supervised learner. Theorem 1 from [12] showed that this 
conjecture is true up to a logarithmic factor, much like our result, for classes with 
finite VC-dimension, and SSL that do not make any distributional assumptions. 
Corollary 2 shows that this statement also holds in some scenarios for all SSL 
that fall in our proposed framework. This is somewhat surprising, as our result 
holds explicitly for SSLs that do make assumptions about the distribution: MR 
assumes the labeling function behaves smoothly w.r.t. the underlying manifold. 


6 Rademacher Complexity of Manifold Regularization 


In order to find out in which scenarios semi-supervised learning can help it is 
useful to also look at distribution dependent complexity measures. For this we 
derive computational feasible upper and lower bounds on the Rademacher com- 
plexity of MR. We first review the work of [20]: they create a kernel such that 
the inner product in the corresponding kernel Hilbert space contains automati- 
cally the regularization term from MR. Having this kernel we can use standard 
upper and lower bounds of the Rademacher complexity for RKHS, as found 
for example in [10]. The analysis is thus similar to [21]. They consider a co- 
regularization setting. In particular [20, p. 1] show the following, here informally 
stated, theorem. 


Theorem 3 ((20, Propositions 2.1, 2.2]). Let H be a RKHS with inner prod- 
uct (,-)g. Let U = {21,...,8n4m}, f,g € H and fo = (fF 1), «4F Gaim)’: 
Furthermore let (-,-)pn be any inner product in R”. Let H be the same space of 
functions as H, but with a newly defined inner product by (f,9) 4 = (f.9)H + 
(fu, gurr. Then H is a RKHS. 


? Note that the original formulation is in terms of the fat-shattering dimension, but 
this is always bounded by the pseudo-dimension. 
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Assume now that L is a positive definite n-dimensional matrix and we set 
the inner product (fy,gu)R> = fi Lou. By setting L as the Laplacian matrix 
(Sect.4) we note that the norm of H automatically regularizes w.r.t. the data 
manifold given by {21,...,2n4+m}. We furthermore know the exact form of the 
kernel of H. 


Theorem 4 ([20, Proposition 2.2]). Let k(x,y) be the kernel of H, K be 
the gram matrix given by Ki; = k(zi, £j) and ky = (k(a1,2),...,k(@n4m,2))*. 
Finally let I be the n+ m dimensional identity matrix. The kernel of H is then 
given by k(a,y) = k(x, y) — kt (I + LK)! Lky. 


This interpretation of MR is useful to derive computationally feasible upper and 
lower bounds of the empirical Rademacher complexity, giving distribution depen- 
dent complexity bounds. With o = (01, ..., on) iid Rademacher random vari- 
ables (i.e. P(o; = 1) = P(o; = —1) = $.), recall that the empirical Rademacher 
complexity of the hypothesis class H and measured on the sample labeled input 
features {21,...,Un} is defined as 


Rad, (H) = x pa sup Safad: (15) 


no feH 


Theorem 5 ([10, p. 333]). Let H be a RKHS with kernel k and H, = {f € 
H | |\flla < r}. Given an n sample {z1,..., £n} we can bound the empirical 
Rademacher complexity of Hy by 


XO k(zi, x) < Rada (Fr) < 


i=1 


(16) 


ny2 


The previous two theorems lead to upper bounds on the complexity of MR, in 
particular we can bound the maximal reduction over supervised learning. 


Corollary 3. Let H be a RKHS and for f,g € H define the inner product 
(FD a = (F, 9u + fuluL)gt, where L is a positive definite matrix and u € R 
is a regularization parameter. Let H, be defined as before, then 


2 2 1 
Radn(Hr) < =, | XC kli zi) — koi (GI + LK) Dk. (17) 
im 


r 
n 
Similarly we can obtain a lower bound in line with Inequality (16). 


The corollary shows in particular that the difference of the Rademacher com- 
plexity of the supervised and the semi-supervised method is given by the term 
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ki (Z Inm + LK)~'Lk,,. This can be used for example to compute general- 
ization bounds [17, Chapter 3]. We can also use the kernel to compute local 
Rademacher complexities which may yield tighter generalization bounds [5]. Here 
we illustrate the use of our bounds for choosing the regularization parameter p 
without the need for an additional labeled validation set. 


7 Experiment: Concentric Circles 


We illustrate the use of Eq. (17) for model selection. In particular, it can be 
used to get an initial idea of how to choose the regularization parameter u. The 
idea is to plot the Rademacher complexity versus the parameter p as in Fig. 1. 
We propose to use an heuristic which is often used in clustering, the so called 
elbow criteria [9]. We essentially want to find a u such that increasing the u will 
not result in much reduction of the complexity anymore. We test this idea on a 
dataset which consists out of two concentric circles with 500 datapoints in R?, 
250 per circle, see also Fig. 2. We use a Gaussian base kernel with bandwidth set 
to 0.5. The MR matrix L is the Laplacian matrix, where weights are computed 
with a Gaussian kernel with bandwidth 0.2. Note that those parameters have 
to be carefully set in order to capture the structure of the dataset, but this is 
not the current concern: we assume we already found a reasonable choice for 
those parameters. We add a small L2-regularization that ensures that the radius 
r in Inequality (17) is finite. The precise value of r plays a secondary role as the 
behavior of the curve from Fig. 1 remains the same. 

Looking at Fig. 1 we observe that for u smaller than 0.1 the curve still drops 
steeply, while after 0.2 it starts to flatten out. We thus plot the resulting kernels 
for u = 0.02 and u = 0.2 in Fig. 2. We plot the isolines of the kernel around the 
point of class one, the red dot in the figure. We indeed observe that for u = 0.02 
we don’t capture that much structure yet, while for u = 0.2 the two concentric 
circles are almost completely separated by the kernel. If this procedure indeed 
elevates to a practical method needs further empirical testing. 


0.48 
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0.44- 


0.42- 


0.40; 


0.38 1 1 n fi 1 1 fi fi 1 
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Manifold regularization parameter 


Rademacher complexity bound 


Fig. 1. The behavior of the Rademacher complexity when using manifold regularization 
on circle dataset with different regularization values p. 
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Fig. 2. The resulting kernel when we use manifold regularization with parameter u set 
to 0.02 and 0.2. 


8 Discussion and Conclusion 


This paper analysed improvements in terms of sample or Rademacher complexity 
for a certain class of SSL. The performance of such methods depends both on 
how the approximation error of the class F compares to that of FY? and on the 
reduction of complexity by switching from the first to the latter. In our analysis 
we discussed the second part. The first part depends on a notion the literature 
often refers to as a semi-supervised assumption. This assumption basically states 
that we can learn with F¥ as good as with F. Without prior knowledge, it is 
unclear whether one can test efficiently if the assumption is true or not. Or is 
it possible to treat just this as a model selection problem? The only two works 
we know that provide some analysis in this direction are [3], which discusses 
the sample consumption to test the so-called cluster assumption, and [2], which 
analyzes the overhead of cross-validating the hyper-parameter coming from their 
proposed semi-supervised approach. 

As some of our settings need restrictions, it is natural to ask whether we can 
extend the results. First, Lemma 1 restricts us to convex optimization problems. 
If that assumption would be unnecessary, one may get interesting extensions. 
Neural networks, for example, are typically not convex in their function space 
and we cannot guarantee the fast learning rate from Theorem 2. But maybe there 
are semi-supervised methods that turn this space convex, and thus could achieve 
fast rates. In Theorem 2 we have to restrict the loss to be the square loss, and 
(1, Example 21.16] shows that for the absolute loss one cannot achieve such a 
result. But whether Theorem 2 holds for the hinge loss, which is a typical choice 
in classification, is unknown to us. We speculate that this is indeed true, as at 
least the related classification tasks, that use the 0-1 loss, cannot achieve a rate 
faster than + [19, Theorem 6.8}. 

Corollary 2 sketches a scenario in which sample complexity improvements of 
MR can be at most a constant over their supervised counterparts. This may sound 
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like a negative result, as other methods with similar assumptions can achieve expo- 
nentially fast learning rates [16, Chapter 6]. But constant improvement can still 
have significant effects, if this constant can be arbitrarily large. If we set the reg- 
ularization parameter p in the concentric circles example high enough, the only 
possible classification functions will be the one that classifies each circle uniformly 
to one class. At the same time the pseudo-dimension of the supervised model can 
be arbitrarily high, and thus also the constant in Corollary 2. In conclusion, one 
should realize the significant influence constant factors in finite sample settings 
can have. 
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Abstract. Designing, selling and/or exploiting connected vertical urban 
farms is now receiving a lot of attention. In such farms, plants grow in 
controlled environments according to recipes that specify the different 
growth stages and instructions concerning many parameters (e.g., tem- 
perature, humidity, CO2, light). During the whole process, automated 
systems collect measures of such parameters and, at the end, we can 
get some global indicator about the used recipe, e.g., its yield. Looking 
for innovative ideas to optimize recipes, we investigate the use of a new 
optimal subgroup discovery method from purely numerical data. It con- 
cerns here the computation of subsets of recipes whose labels (e.g., the 
yield) show an interesting distribution according to a quality measure. 
When considering optimization, e.g., maximizing the yield, our virtuous 
circle optimization framework iteratively improves recipes by sampling 
the discovered optimal subgroup description subspace. We provide our 
preliminary results about the added-value of this framework thanks to a 
plant growth simulator that enables inexpensive experiments. 


Keywords: Subgroup discovery - Virtuous circle + Urban farms 


1 Introduction 


Conventional farming methods have to face many challenges like, for instance, 
soil erosion and/or an overuse of pesticides. The crucial problems related to 
climate change also stimulate the design of new production systems. The concept 
of urban farms (see, e.g., AeroFarms, FUL, Infarm') could be part of a solution. 
It enables the growth of plants in fully controlled environments close to the place 
where consumers are [8]. Most of the crop protection chemical products can be 
removed while being able to optimize both the quantity and the quality of plants 
(e.g., improving the flavor [9] or their chemical proportions [20]). 


1 https: //aerofarms.com/, http://www.fermeful.com/, https://infarm.com/. 
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Urban farms can generate large amounts of data that can be pushed towards a 
cloud environment such that various machine learning and data mining methods 
can be used. We may then provide new insights about the plant growth process 
itself (discovering knowledge about not yet identified/understood phenomena) 
but also offer new services to farm owners. We focus here on services that rely 
on the optimization of a given target variable, e.g., the yield. The number of 
parameters influencing plant growth can be relatively large (e.g., temperature, 
hygrometry, water pH level, nutrient concentration, LED lighting intensity, CO2 
concentration). There are numerous ways of measuring the crop end-product 
(e.g., energy cost, plant mass and size, flavor and chemical properties). In gen- 
eral, for a given type of plants, expert knowledge exists that concerns the avail- 
able sub-systems (e.g., to model the impact of nutrient on growth, the effect 
of LED lighting on photosynthesis, the energy consumption w.r.t. the tempera- 
ture instruction) but we are far from a global understanding of the interaction 
between the various underlying phenomena. In other terms, setting the optimal 
instructions for the diverse set of parameters given an optimization task remains 
an open problem. 

We want to address such an issue by means of data mining techniques. Plant 
growth recipes are made of instructions in time and space for many numerical 
attributes. Once a recipe is completed, collections of measures have been col- 
lected and we assume that at least one numerical target label value is available, 
e.g., the yield. Can we learn from available recipe records to suggest new ones 
that should provide better results w.r.t. the selected target attribute? For that 
purpose, we investigate the use of subgroup discovery [12,21]. It aims at discov- 
ering subsets of objects - called subgroups - with high quality according to a 
quality measure calculated on the target label. Such a quality measure has to 
capture deviations in the target label distribution when we consider the overall 
data set or the considered subset of objects. When addressing only subgroup 
discovery from numerical data, a few approaches for numerical attributes [6, 15] 
and numerical target labels [14] have been described. To the best of our knowl- 
edge, the reference algorithm for subgroup discovery in purely numerical data 
is SD-Map* [14]. However, like other methods, it uses discretization and leads to 
loss of information and sub-optimal results. 

Our first contribution concerns the proposal of a simple branch and bound 
algorithm called MinIntChange4SD that exploits the exhaustive enumeration 
strategy from [11] to achieve a guaranteed optimal subgroup discovery in numeri- 
cal data without any discretization. Discussing details about this algorithm is out 
of the scope of this paper and we recently designed a significantly optimized ver- 
sion of MinIntChange4sD in [17]. Our main contribution concerns a new method- 
ology for plant growth recipe optimization that (i) uses MinIntChange4SD to 
find the optimal subgroup of recipes and (ii) exploits the subgroup description 
to design better recipes which can in turn be analyzed with subgroup discovery, 
and so on. 

The paper is organized as follows. Section 2 formalizes the problem. In Sect. 3, 
we discuss related works and their limitations. In Sect. 4, we introduce our new 
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Fig. 1. (left) Purely numerical dataset. (center) Non-closed (pi = ((2, 4], [1,3]), non- 
hatched) and closed (p2 = ([2, 4], [2, 3]), hatched) interval patterns. (right) Depth-first 
traversal of mə using minimal changes. 


optimal subgroup discovery algorithm and we detail our framework for plant 
growth recipe optimization. An empirical evaluation of our method is in Sect. 5. 
Section 6 briefly concludes. 


2 Problem Definition 


Numerical Dataset. A numerical dataset (G, M,T) is given by a set of objects 
G, a set of numerical attributes M and a numerical target label T. In a given 
dataset, the domain of any attribute m € M (resp. label T) is a finite ordered 
set denoted D,, (resp. Dr). Figure 1 (left) provides a numerical dataset made of 
two attributes M = {m ,mz2} and a target label T. A subgroup p is defined by 
a pattern, i.e., its intent or description, and the set of objects from the dataset 
where it appears, i.e., its extent, denoted ext(p). For instance, in Fig.1, the 
domain of mı is {1,2,3,4} and the intent ([2, 4],[1,3]) (see the definition of 
interval patterns later) denotes a subgroup whose extent is {93, 94, 95,96}- 


Quality Measure, Optimal Subgroup. The interestingness of a subgroup in a 
numerical dataset is measured by a numerical value. We consider here the quality 
measure based on the mean introduced in [14]. Let p be a subgroup. The quality 
of p is given by: Orewa?) 7 lext(p)|* x (Heat(p) = Hezt(0)) a € (0, 1). |ext(p)| 
denotes the cardinality of ext(p), Hext(p) is the mean of the target label in the 
extent of p, Her1(9) is the mean of the target label in the overall dataset, and a is 
a parameter that controls the number of objects of the subgroups. Let (G, M, T) 
be a numerical dataset, q a quality measure and P the set of all subgroups of 
(G,M,T). A subgroup p € P is said to be optimal iff Vp’ € P : q(p’) < q(p). 


Plant Growth Recipe and Optimization Measure. A plant growth recipe 
(M, P, T) is given by a set of numerical parameters M specifying the growing 
conditions thanks to intervals on numerical values, a numerical value P repre- 
senting the number of stages of the growth cycle, and a numerical target label T 
to quantify the recipe quality. In a given recipe, each parameter of M is repeated 
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P times s.t. we have |M|x P numerical attributes. Our goal is to optimize recipes 
and we want to discover actionable patterns in the sense that delivering such 
patterns will support the design of new growing conditions. An optimization 
measure f quantifies the quality of an iteration. We are interested in the mean 
of the target label of the objects of the optimal subgroup after each iteration. 


The measure is given by fmean = Braso E where T(i) is the value of the 
target label for object i. 


3 Related Work 


Designing recipes that optimize a given target attribute (e.g., the mass, the 
energy cost) is often tackled by domain experts who exploit the scientific liter- 
ature. However, in our setting, it has two major drawbacks. First, most of the 
literature remains oriented towards conventional growing conditions and farm- 
ing methods. In urban farms, there are more parameters that can be controlled. 
Secondly, the amount of knowledge about plants is unbalanced from one plant to 
another. Therefore, relying only on expert knowledge for plant recipe optimiza- 
tion is not sufficient. We have an optimization problem and the need for a limited 
number of iterations. Indeed, experimenting with plant growth recipes is time 
consuming (i.e., asking for weeks or months). Therefore, we have to minimize 
the number of experiments that are needed to optimize a given recipe. There are 
two main families of methods addressing the problem of optimizing a function 
over numerical variables: direct and model-based [18]. For direct methods, the 
common idea is to apply various strategies to sequentially evaluate solutions in 
the search space of recipes. However such methods do not address the problem 
of minimizing the number of experiments. For model-based methods, the idea 
is to build a model simulating the ground truth using available data and then 
to use it to guide the search process. For instance, [9] introduced a solution for 
recipe optimization using this type of method with the goal of optimizing the 
flavor of plants. Their framework is based on using a surrogate model, in this 
case a Symbolic Regression [13]. It considers recipe optimization by means of a 
promising virtuous circle. However, it suffers from several shortcomings: there 
is no guarantee on the quality of the generated models (i.e., they may not be 
able to model correctly the ground truth), the number of tested parameters is 
small (only 3), and the ratio between the number of objects and the number of 
parameters in the data needs to be at least ten for Symbolic Regression [10]. 
Clearly, it would restrict the search to only a few parameters. 

Heuristic [2,15] and exhaustive [1,5] solutions have been proposed for sub- 
group discovery. Usually, these approaches consider a set of nominal attributes 
with a binary label. To work with numerical data, prior discretization of the 
attributes is then required (see, e.g., [3]) and it leads to loss of information and 
suboptimal results. A major issue with exhaustive pattern mining is the size 
of the search space. Fortunately, optimistic estimates can be used to prune the 
search space and provide tractability in practice [7,21]. [14] introduces a large 
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panel of quality measures and corresponding optimistic estimates for an exhaus- 
tive subgroup mining given numerical target labels. They describe SD-Map*, 
the reference algorithm for subgroup discovery in numerical data. Notice how- 
ever that for [14] or others [6,15], discretization techniques over the numerical 
attributes have to be performed. When looking for an exhaustive search of fre- 
quent patterns - not subgroups - in numerical data without discretization, we 
find the MinIntChange algorithm [11]. Using closure operators (see, e.g., [4]) has 
become a popular solution to reduce the size of the search space. We indeed 
exploit most of these ideas to design our optimal subgroup discovery algorithm. 


4 Optimization with Subgroup Discovery 


4.1 An Efficient Algorithm for Optimal Subgroup Discovery 


Let us first introduce MinIntChange4sD, our branch and bound algorithm for the 
optimal subgroup discovery in purely numerical data. It exploits smart concepts 
about interval patterns from [11]. 


Interval Patterns, Extent and Closure. In a numerical dataset (G, M, T), 
an interval pattern p is a vector of intervals p = ([ai, bi) sett... MI} with a;,b; € 
Dmi, Where each interval is a restriction on an attribute of M, and |M| is the 
number of attributes. Let g € G be an object. g is in the extent of an interval 
pattern p = (ais Bl) seg, m iff Vi € {1,...,|M]}, mi(g) € [ai, bi]. Let pı and 
p2 be two interval patterns. pı C po means that po encloses pj, i.e., the hyper- 
rectangle of pı is included in that of po. It is said that pı is a specialization of 
p2. Let p be an interval pattern and ext(p) its extent. p is defined as closed if 
and only if it is the most restrictive pattern (i.e., the smallest hyper-rectangle) 
that contains ext(p). Figure 1 (center) depicts the dataset of Fig. 1 (left) in a 
cartesian plane as well as examples of interval patterns that are closed (p2) or 
not (pı). 


Traversing the Search Space with Minimal Changes. To guarantee the 
optimal subgroup discovery, we proceed to the so-called minimal changes intro- 
duced in MinIntChange. It enables an exhaustive enumeration within the interval 
pattern search space. A left minimal change consists in replacing the left bound 
of an interval by the current value closest higher value in the domain of the 
corresponding attribute. Similarly, a right minimal change consists in replacing 
the right bound by the current value closest lower value. The search starts with 
the computation of the minimal interval pattern that covers all the objects of 
the dataset. The premise is to apply consecutive right or left minimal changes 
until obtaining an interval whose left and right bounds have the same value for 
each interval of the minimal interval pattern. In that case, the algorithm back- 
tracks until it finds a pattern on which a minimal change can be applied. Figure 1 
(right) depicts the depth-first traversal of attribute mz from the dataset of Fig. 1 
(left) using minimal changes. 
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Compressing and Pruning the Search Space. We leverage the concept of 
closure to significantly reduce the number of candidate interval patterns. After 
a minimal change and instead of evaluating the resulting interval pattern, we 
compute its corresponding closed interval pattern. We exploit advanced prun- 
ing techniques to reduce the size of the search space thanks to the use of a 
tight optimistic estimate. We also exploit a combination of forward checking 
and branch reordering. Given an interval pattern, the set of all its direct spe- 
cializations (application of a right or left minimal change on each interval) are 
computed - forward checking - and those whose optimistic estimate is higher than 
the best subgroup quality are stored. Branch reordering by descending order of 
the optimistic estimate value is then carried out which enables to explore the 
most promising parts of the search space first. It also enables a more efficient 
pruning by raising the minimal quality early. In fact, providing details about the 
algorithm is out of the scope of this paper though its source code is available at 
https://bit.ly/3bA87NE. The important outcome is that it guarantees the dis- 
covery of optimal subgroups for a given quality measure. Indeed, provided that it 
remains tractable, the runtime efficiency is not here an issue given that we want 
to use the algorithm at some steps of quite slow vegetable growth processes. 


4.2 Leveraging Subgroups to Optimize Recipes 


A Virtuous Circle. Our optimization framework can be seen as a virtuous 
circle, where each new iteration uses information previously gathered to itera- 
tively improve the targeted process. First, a set of recipe experiments - which 
can be created with or without the use of expert knowledge - is created. With 
the use of expert knowledge, values or domain of values are defined for each 
attribute and then recipes are produced using these values. When generating 
recipes without prior knowledge, we create recipes by randomly sampling the 
values of each attribute. Secondly, we use subgroup discovery to find the best 
subgroup of recipes according to the chosen quality measure (e.g., the subgroup 
of recipes with the best average yield). Then, we exploit the subgroup descrip- 
tion - i.e., we apply new restrictions on the range of each parameter according 
to the description - to generate new, better, recipe experiments. Finally these 
recipes are in turn processed to find the best subgroup for the new recipes, and 
so on until recipes cannot be improved anymore. This way, we sample recipes in 
a space which gets smaller after each iteration and where the ratio between good 
and bad solutions gets larger and larger. Figure 2 depicts a step-by-step exam- 
ple of the process behind the framework. Our framework makes use of several 
hyperparameters that affect runtime efficiency, the number of iterations and the 
quality of the results. 


Convergence. The first hyperparameter is the parameter a used in the q% ean 
quality measure. In standard subgroup discovery, it controls the number of 
objects in the returned subgroups. A higher value of a means larger subgroups. 
For us, a larger subgroup means a larger search space to sample. By extension, a 
higher value of a means more iterations to be able to reach smaller subspaces of 
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Iteration Iteration 2 Iteration 3 


my 


Fig. 2. Example of execution of the optimization framework in 3 iterations. We consider 
a two-dimensional space (i.e., 2 attributes mı and m2) where 4 recipes are generated 
during each iteration using our first sampling method. The best subgroup (optimizing 
the yield) of each iteration (hatched) serves as the next iteration sampling space. 


the search space. For that reason, we rename the parameter as the convergence 
rate. The second hyperparameter is called the minimal improvement (minImp). 
It defines the minimal improvement of the Optimization measure - fmean in our 
setting - needed from one iteration to another for the framework to keep running. 
After each iteration, we check whether the following statement is true or false. 


Treen = Theanga 


> minImp 

f MeaNit—1 
If it is true, then the optimization framework keeps running, else we consider 
that the recipes cannot be improved any further. This parameter has a direct 
effect on the number of iterations needed for the algorithm to converge. A higher 
value for minImp means a lower number of iterations and vice versa. We can also 
forget minImp and set the number of iterations by means of another parameter 
that would denote a budget. 


Sampling the Subspace. After each iteration, to generate new recipes to 
experiment with, we need to sample the subspace corresponding to the descrip- 
tion of the best subgroup. Three sampling methods are currently available and 
this defines again a new hyperparameter. The first method consists in sampling 
recipes using the original set of values of each attribute (i.e., in the first iter- 
ation) minus the excluded values due to the new restrictions applied on the 
subspace. Let D}, be the domain of values of attribute m at Iteration 1 and 
aż, bn] be the interval of attribute m at Iteration i according to the description 
of the best subgroup of Iteration i— 1. Then, Vv € D}, v € Dt, & bi, > v > af,- 
Using this method, the number of values available for sampling for each attribute 
gets smaller after each iteration, meaning that each iteration is faster than the 
previous one. The second consists in discretizing the search space through the 
discretization of each attribute in k intervals of equal length. Parameter k is 
set before launching the framework. Recipes are then sampled using the dis- 
cretized domain of values for each attribute. Finally, we can use Latin Hypercube 
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Sampling [16] as a third method. In Latin Hypercube Sampling, each attribute 
is divided in S$ equally probable intervals, with S the number of samples (i.e., 
recipes). Using this method, recipes are sampled such that each recipe is the 
only one in each hyperspace that contains it. The number of samples generated 
for each iteration is also a hyperparameter of the framework. 


An Ezxplainable Generic Framework. Our optimization framework is 
explainable contrary to black box optimization algorithms. Each step of the pro- 
cess is easily understandable due to the descriptive nature of subgroup discovery. 
Although we have been referring to our algorithm MinIntChange4SD when intro- 
ducing the optimization framework, other subgroup discovery algorithms can be 
used, including [14] and [17]. Notice however that the better the quality of the 
provided subgroup, the better the results returned by our framework will be. 
Finally, our method can be applied to quite many application domains where 
we want to optimize a numerical target given collections of numerical features 
(e.g., hyperparameter optimization in machine learning). 


5 Experiments 


We work on urban farm recipe optimization while we do not have access to real 
farming data yet. One of our partners in the FUI DUF 4.0 project (2018-2021) 
is designing new types of urban farms. We found a way to support the empiri- 
cal study of our recipe optimizing framework thanks to inexpensive experiments 
enabled by a simulator. In an urban farm, plants grow in a controlled envi- 
ronment. In the absence of failure, recipe instructions are followed and we can 
investigate the optimization of the plant yield at the end of the growth cycle. 
We simulate recipe experiments by using the PCSE? simulation environment by 
setting the characteristics (e.g., the climate) of the different growth stages. We 
focus on 3 variables that set the amount of solar irradiation (range [0, 25000]), 
wind (range [0, 30]) and rain (range [0, 40]). The plant growth is split into 3 stages 
of equal length such that we finally get 9 attributes. In real life, we can control 
most of the parameters of an urban farm (e.g., providing more or less light) 
and a recipe optimization iteration needs for new insights about the promising 
parameter values. This is what we can emulate using the crop simulator: given 
the description of the optimal subgroup, we get insights to support the design 
of the next simulations, say experiments, as if we were controlling the growth 
environment. At the end of the growth cycle, we retrieve the total mass of plants 
harvested using a given recipe. Note that in the following experiments, unless 
stated otherwise, no assumption is made on the values of parameters (i.e., no 
restriction is applied on the range of values defined above and expert knowledge 
is not taken into account). Table1 features examples of plant growth recipes. 
The source code and datasets used in our evaluation are available at https:// 
bit.ly/3bA87NE. 


? https: //pcse.readthedocs.io/en/stable/index.html. 
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Table 1. Examples of growth recipes split in 3 stages (P1, P2, P3), 3 attributes, and 
a target label (Yield). 


R| Rain?! |IrradP! | Wind?! | Rain?? | Irrad?? | Wind?? Rain?’ | Irrad?? | Wind’? | Yield 
r,|10 23250 5 10 23250 5 15 21000 |10 22000 
rg |35 10000 14 5 25000 | 10 16 19500 |30 20500 
r3|15 17500 |26 22 15000 |18 30 4000 3 8600 
r4|18 22800 |17 38 17000 |25 38 12000 |19 14200 


Table 2. Comparison between descriptions of the overall dataset (DS), the optimal 
subgroup returned by MinIntChange4SD (MIC4SD), the optimal subgroup returned 
by SD-Map*. “—” means no restriction on the attribute compared to DS, Q and S 
respectively the quality and size of the subgroup. 


Pllrrad? t Wind?" Rain?? Irrad?? Wind? 2? Rain? 3itrrad?3 Wind?’ 3iQ s 


Subgroup\Rain 


DS [0,39] |[1170, 23471]|[2, 29] |[0, 37] [111, 24111] [0,29] [2,40] |[964, 24197] |[1,30] lo 30 
MIC4SD [[16, 37] |[1170, 22085]|[2, 24] [7,37] [18309, 23584][2, 24] |[15, 37] |[12626, 24197]|[1, 25] |33874ļ7 
SD-Map*|[21, 39] - L [14455, 24111] L [12760, 24197] 3066215 


5.1 MinIntChange4SD vs SD-Map* 


We study the description of the best subgroup returned by MinIntChange4SD 
and SD-Map*, the state-of-the art algorithm for subgroup discovery in numeri- 
cal data. Table 2 depicts the descriptions for a dataset comprised of 30 recipes 
generated randomly with the simulator. Besides the higher quality of the sub- 
group returned by MinIntChange4SD, the optimal subgroup description also 
enables to extract information that is missing from the description obtained 
with SD-Map*. In fact, where SD-Map* only offers a strong restriction on 3 
attributes, MinIntChange4SD provides actionable information on all the con- 
sidered attributes, i.e., the 9 attributes. This confirms its qualitative superiority 
over SD-Map* which has to proceed to attribute discretizations. 


5.2 Empirical Evaluation of the Model Hyperparameters 


Our optimization framework involves several hyperparameters whose values need 
to be studied to define proper ranges or values that will lead to optimized results 
with a minimized number of recipe experiments. We choose to apply a random 
search on discretized hyperparameters. Note that in this setting, grid search 
is a bad solution due to the combinatorial number of hyperparameter values 
and the high time cost of the optimization process itself. We discretize each 
hyperparameter in several values (the convergence rate is split into 10 values 
ranging from 0.1 to 1, the minimal improvement parameter is split into 12 values 
between 0 and 0.05, the sampling parameter is split between the 3 available 
methods, and the number of recipes for each iteration is either 20 or 30). We 
run 100 iterations of random search, with each iteration - read set of parameter 
values - being tested 10 times and averaged to account for randomness of the 
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Fig. 3. Yield of the best recipe depending on the value of different hyperparameters 
using 100 sample recipes for each hyperparameter. 
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recipes generated. After each iteration of random search, we store the set of 
hyperparameter values and the corresponding best recipe found. Figure 3 depicts 
results of the experiments. Optimal values for convergence rate seem to be around 
0.5, between 0.001 and 0.01 for minimal improvement, and the best sampling 
method is tied between the first and second one. Generating 30 recipes for each 
iteration yields better results than 20 (average yield of 23857 for 30 recipes 
against 22829 for 20 recipes). To compare our method against other methods, we 
run our framework with the following parameters: 30 recipes times 5 iterations 
(for a total of 150 recipes), 0.5 convergence rate, using the second sampling 
method with k = 15. To address the variance in the yield due to randomness 
in the recipe generation process, we run the framework 10 times, we store the 
best recipe found at each iteration and then compute the average of the stored 
recipes. We report the results in Table 3. 


5.3 Comparison with Alternative Methods 


Good hyperparameter values have been defined for our optimization framework 
and we can now compare our method with other ones. Let us consider the use 
of expert knowledge and random search. First, we want to create a model using 
expert knowledge. With the help of an agricultural engineer, we defined a priori 
good values for each parameter using expert knowledge and we generated a recipe 
that can serve as a baseline for our experiments. We then choose to compare our 
method against a random search model without expert knowledge. We set the 
number of recipes to 150 for all methods to provide a fair comparison with our 
own model where the number of recipes is set to 150. To account for randomness 
in the recipe generation, we run 10 iterations of the random search model, we 
store the value of the best recipe found in each iteration, and we compute their 
average yield. Results of the experiments and a description of the best recipe for 
each method are available in Table 3. Random search and expert knowledge find 
recipes with almost equal yields, while our framework find recipes with higher 
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Table 3. Comparison of the description and the yield of the best recipe returned 
by each method. EK = Expert Knowledge, RS = Random Search, SM = Surrogate 
Modeling, VC = Virtuous Circle (our framework). 


Method|Rain?? |Irrad?!| Wind?! |Rain?? |Irrad?? | Wind??? Rain?’ |Irrad?? | Wind? | Yield 
EK 10 0 5 10 25000 5 10 25000 5 23472 
RS 17 23447 8 31 22222 |23 39 22385 7 23561 
SM 20 44 0 20 24981 0 40 31 |30 10170 
VC 19 16121 |18 25 24052 |28 14 21126 7 24336 


yield. Note that in industrial settings, an improved yield of 3% to 4% has a 
significant impact on revenues. 

Let us now compare our framework to the Surrogate Modeling method pre- 
sented in [9]. To be fair, we give the same number of data points to build the 
Symbolic Regression surrogate model as we used in previous experiments, i.e., 
150 for training the model (we evaluated the RMSE of the model on a test set 
of 38 other samples). We use gplearn [19], with default parameters, except for 
the number of generations and the number of models evaluated for each gener- 
ations, which are respectively of 1000 and 2000, as in [9]. Note that the model 
obtained has a RMSE of 2112, and it is composed of more than 2000 terms 
(including mathematical operators), therefore the argument of interpretability 
is questionable. A grid search is finally done on this model and we select the 
best recipe and obtain their true yield using the PCSE simulation environment. 
The number of steps for each attribute for the grid search has to be defined. 
We set it to 5. As we have 9 parameters, it means that the model needs to be 
evaluated on nearly 9 million potential recipes. Also, the model is composed of 
hundreds of terms such that experiments are computationally expensive. The 
best recipe found so far is given in Table 3. The surrogate model predicts a yield 
value of 21137. Compared to the ground truth of 10170, the model has a strong 
bias. It illustrates that using a surrogate model for this kind of problem will 
give good recipes only if it is reliable enough. Interestingly, the RMSE seems to 
be quite good at first glance, but this does not guarantee that the model will 
behave correctly on all elements of the search space: on the best recipe found, 
it largely overestimates the yield, leading to a non-interesting recipe. It seems 
that this method performs poorly on recipes with more attributes than in [9]. 
Further studies are here needed. 


6 Conclusion 


We investigated the optimization of plant growth recipes in controlled environ- 
ments, a key process in connected urban farms. We motivated the reasons why 
existing methods fall short of real life constraints, including the necessity to min- 
imize the number of experiments needed to provide good results. We detailed a 
new optimization framework that leverages subgroup discovery to iteratively find 
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better growth recipes through the use of a virtuous circle. We also introduced 
an efficient algorithm for the optimal subgroup discovery in purely numerical 
datasets. It has been recently improved much further in [17]. We avoid dis- 
cretization and it provides a qualitative added-value (i.e., more interesting opti- 
mal subgroups). Future work includes extending our framework to deal with 
multiple target labels at the same time (e.g., optimizing the yield while keeping 
the energy cost as low as possible). 


Acknowledgment. Our research is partially funded by the French FUI programme 
(project DUF 4.0, 2018-2021). 
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Abstract. The evaluation of machine learning (ML) pipelines is essen- 
tial during automatic ML pipeline composition and optimisation. The 
previous methods such as Bayesian-based and genetic-based optimisa- 
tion, which are implemented in Auto-Weka, Auto-sklearn and TPOT, 
evaluate pipelines by executing them. Therefore, the pipeline composi- 
tion and optimisation of these methods requires a tremendous amount 
of time that prevents them from exploring complex pipelines to find bet- 
ter predictive models. To further explore this research challenge, we have 
conducted experiments showing that many of the generated pipelines are 
invalid, and it is unnecessary to execute them to find out whether they 
are good pipelines. To address this issue, we propose a novel method to 
evaluate the validity of ML pipelines using a surrogate model (AVATAR). 
The AVATAR enables to accelerate automatic ML pipeline composition 
and optimisation by quickly ignoring invalid pipelines. Our experiments 
show that the AVATAR is more efficient in evaluating complex pipelines 
in comparison with the traditional evaluation approaches requiring their 
execution. 


1 Introduction 


Automatic machine learning (AutoML) has been studied to automate the pro- 
cess of data analytics to collect and integrate data, compose and optimise ML 
pipelines, and deploy and maintain predictive models [1-3]. Although many 
existing studies proposed methods to tackle the problem of pipeline composi- 
tion and optimisation [2,4-9], these methods have two main drawbacks. Firstly, 
the pipelines’ structures, which define the executed order of the pipeline com- 
ponents, use fixed templates [2,5]. Although using fixed structures can reduce 
the number of invalid pipelines during the composition and optimisation, these 
approaches limit the exploration of promising pipelines which may have a vari- 
ety of structures. Secondly, while evolutionary algorithms based methods [4] 
enable the randomness of the pipelines’ structure using the concept of evolu- 
tion, this randomness tends to construct more invalid pipelines than valid ones. 


© The Author(s) 2020 
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Besides, the search spaces of the pipelines’ structures and hyperparameters of the 
pipelines’ components expand significantly. Therefore, the existing approaches 
tend to be inefficient as they often attempt to evaluate invalid pipelines. There 
are several attempts to reduce the randomness of pipeline construction by using 
context-free grammars [8,9] or AI planning to guide the construction of pipelines 
[6,7]. Nevertheless, all of these methods evaluate the validity of a pipeline by exe- 
cuting them (T-method). After executing a pipeline, if the result is a predictive 
model, the T-method evaluates the pipeline to be valid; otherwise it is invalid. 
If a pipeline is complex, the complexity of preprocessing/predictor components 
within the pipeline is high, or the size of the dataset is large, the evaluation of 
the pipeline is expensive. Consequently, the optimisation will require a significant 
time budget to find well-performing pipelines. 

To address this issue, we propose the AVATAR to evaluate ML pipelines 
using their surrogate models. The AVATAR transforms a pipeline to its surro- 
gate model and evaluates it instead of executing the original pipeline. We use 
the business process model and notation (BPMN) [10] to represent ML pipelines. 
BPMN was invented for the purposes of a graphical representation of business 
processes, as well as a description of resources for process execution. In addi- 
tion, BPMN simplifies the understanding of business activities and interpreta- 
tion of behaviours of ML pipelines. The ML pipelines’ components use the Weka 
libraries! for ML algorithms. The evaluation of the surrogate models requires a 
knowledge base which is generated from many synthetic datasets. To this end, 
this paper has two main contributions: 


— We conduct experiments on current state-of-the-art AutoML tools to show 
that the construction of invalid pipelines during the pipeline composition and 
optimisation may lead to bad performance. 

— We propose the AVATAR to accelerate the automatic pipeline composition 
and optimisation by evaluating pipelines using a surrogate model. 


This paper is divided into five sections. After the Introduction, Sect. 2 reviews 
previous approaches to representing and evaluating ML pipelines in the context 
of AutoML. Section 3 presents the AVATAR to evaluate ML pipelines. Section 4 
presents experiments to motivate our research and prove the efficiency of the 
proposed method. Finally, Sect. 5 concludes this study. 


2 Related Work 


Salvador et al. [2] proposed an automatic pipeline composition and optimisation 
method of multicomponent predictive systems (MCPS) to deal with the prob- 
lem of combined algorithm selection and hyperparameter optimisation (CASH). 
This proposed method is implemented in the tool AutoWeka4MCPS [2] devel- 
oped on top of Auto-Weka 0.5 [11]. The pipelines, which are generated by 


1 https: //www.cs.waikato.ac.nz/ml/weka/. 
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AutoWeka4MCPS, are represented using Petri nets [12]. A Petri net is a mathe- 
matical modelling language used to represent pipelines [2] as well as data service 
compositions [13]. The main idea of Petri nets is to represent transitions of 
states of a system. Although it is not clearly mentioned in these previous works 
[4-7], directed acyclic graph (DAG) is often used to model sequential pipelines 
in the methods/tools such as AutoWeka4MCPS [14], ML-Plan [6], P4ML [7], 
TPOT [4] and Auto-sklearn [5]. DAG is a type of graph that has connected 
vertexes, and the connections of vertexes have only one direction [15]. In addi- 
tion, a DAG does not allow any directed loop. It means that it is a topological 
ordering. ML-Plan generates sequential workflows consisting of ML components. 
Thus, the workflows are a type of DAG. The final output of P4ML is a pipeline 
which is constructed by making an ensemble of other pipelines. Auto-sklearn 
generates fixed-length sequential pipelines consisting of scikit-learn components. 
TPOT construct pipelines consisting of multiple preprocessing sub-pipelines. 
The authors claim that the representation of the pipelines is a tree-based struc- 
ture. However, a tree-based structure always starts with a root node and ends 
with many leaf nodes, but the output of a TPOT’s pipeline is a single predic- 
tive model. Therefore, the representation of TPOT pipeline is more like a DAG. 
P4ML uses a tree-based structure to make a multi-layer ensemble. This tree- 
based structure can be specialised into a DAG. The reason is that the execution 
of these pipelines will start from leaf nodes and end at root nodes where the 
construction of the ensembles are completed. It means that the control flows of 
these pipelines have one direction, or they are topologically ordered. Using a 
DAG to model an ML pipeline makes it easy to understand by humans as DAGs 
facilitate visualisation and interpretation of the control flow. However, DAGs 
do not model inputs/outputs (i.e. possibly datasets, output predictive models, 
parameters and hyperparameters of components) between vertexes. Therefore, 
the existing studies use ad-hoc approaches and make assumptions about data 
inputs/outputs of the pipelines’ components. 

Although AutoWeka4MCPS, ML-Plan, PAML, TPOT and Auto-sklearn eval- 
uate pipelines by executing them, these methods have strategies to limit the gen- 
eration of invalid pipelines. Auto-sklearn uses a fixed pipeline template includ- 
ing preprocessing, predictor and ensemble components. AutoWeka4MCPS also 
uses a fixed pipeline template consisting of six components. TPOT, ML-Plan 
and P4ML use grammars/primitive catalogues, which are designed manually, to 
guide the construction of pipelines. Although these approaches can reduce the 
number of invalid pipelines, our experiments showed that the wasted time used 
to evaluate the invalid pipelines is significant. Moreover, using fixed templates, 
grammars and primitive catalogues reduce search spaces of potential pipelines, 
which is a drawback during pipeline composition and optimisation. 


3 Evaluation of ML Pipelines Using Surrogate Models 


Because the evaluation of ML pipelines is expensive in certain cases (i.e., com- 
plex pipelines, high complexity pipeline’s components and large datasets) in the 
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context of AutoML, we propose the AVATAR? to speed up the process by eval- 
uating their surrogate pipelines. The main idea of the AVATAR is to expand 
the purpose and representation of MCPS introduced in [12]. The AVATAR uses 
a surrogate model in the form of a Petri net. This surrogate pipeline keeps 
the structure of the original pipeline, replaces the datasets in the form of data 
matrices (i.e., components’ input/output simplified mappings) by the matrices 
of transformed-features, and the ML algorithms by transition functions to calcu- 
late the output from the input tokens (i.e., the matrices of transformed-features). 
Because of the simplicity of the surrogate pipelines in terms of the size of the 
tokens and the simplicity of the transition functions, the evaluation of these 
pipelines is substantially less expensive than the original ones. 


3.1 The AVATAR Knowledge Base 


We define transformed-features as the features, which represent dataset’s charac- 
teristics. These characteristics can be changed because of the transformations of 
this dataset by ML algorithms. Table 1 describes the transformed-features used 


Table 1. Descriptions of the transformed-features of a dataset. 


Transformed-feature Description 

BINARY_CLASS A dataset has binary classes 
NUMERIC_CLASS A dataset has numeric classes 
DATE_CLASS A dataset has date classes 
MISSING_CLASS_VALUES A dataset has missing values in classes 
NOMINAL_CLASS A dataset has nominal classes 
SYMBOLIC_CLASS A dataset has symbolic data in classes 
STRING_CLASS A dataset has string classes 
UNARY_CLASS A dataset has unary classes 
BINARY_ATTRIBUTES A dataset has binary attributes 
DATE_ATTRIBUTES A dataset has date attributes 

EMPTY- -NOMINAL ATTRIBUTES |A dataset has an empty column 
MISSING_VALUES A dataset has missing values in attributes 
NOMINAL_ATTRIBUTES A dataset has nominal attributes 
NUMERIC_ATTRIBUTES A dataset has numeric attributes 
UNARY_ATTRIBUTES A dataset has unary attributes 
PREDICTIVE_MODEL A predictive model generated by a predictor 


? https: //github.com/UTS-AAi/AVATAR. 


356 T.-D. Nguyen et al. 


for the knowledge base. We select these transformed-features because the capa- 
bilities of a ML algorithm to work with a dataset depend on these transformed- 
features. These transformed-features are extended from the capabilities of Weka 
algorithms’. 

The purpose of the AVATAR. knowledge base is for describing the logic of 
transition functions of the surrogate pipelines. The logic includes the capabilities 
and effects of ML algorithms (i.e., pipeline components). 

The capabilities are used to verify whether an algorithm is compatible to work 
with a dataset or not. For example, whether the linear regression algorithm can 
work with missing value and numeric attributes or not? The capabilities have 
a list of transformed-features. The value of each capability-related transformed- 
feature is either 0 (i.e., the algorithm can not work with the dataset which 
has this transformed-feature) or 1 (i.e., the algorithm can work with the dataset 
which has this transformed-feature). Based on the capabilities, we can determine 
which components of a pipeline (i.e., ML algorithms) are not able to process 
specific transformed-features of a dataset. 

The effects describe data transformations. Similar to the capabilities, the 
effects have a list of transformed-features. Each effect-related transformed- 
feature can have three values, 0 (i.e., do not transform this transformed-feature), 
1 (i.e., transform one or more attributes/classes to this transformed-feature), 
or —1 (i.e., disable the effect of this transformed-feature on one or more 
attributes/classes). 

To generate the AVATAR knowledge base*, we have to use synthetic datasets? 
to minimise the number of active transformed-features in each dataset to evaluate 
which and how transformed-features impact on the capabilities and effects of 
ML algorithms®. Real-world datasets usually have many active transformed- 
features that make them not suitable for our purpose. We minimise the number 
of available transformed-features in each synthetic dataset so that the knowledge 
base can be applicable in a variety of pipelines and datasets. Figure 1 presents 
the algorithm to generate the AVATAR knowledge base. This algorithm has four 
main stages: 


1. Initialisation: The first stage initialises all transformed-features in the capa- 
bilities and effects to 0. 

2. Execution: Run ML algorithms with every synthetic dataset and get outputs 
(i.e., output datasets or predictive models). 

3. Find capabilities: If the execution is successful, we set the active transformed- 
features of the input dataset for the ones in the capabilities. 

4. Find effects: If an algorithm is a predictor/transformed-predictor, we set 
PREDICTIVE_MODEL for its effects. If the algorithm is a filter and its 


3 http: //weka.sourceforge.net /doc.dev/weka/core/Capabilities.html. 

* https: //github.com/UTS-AAi/AVATAR/blob/master /avatar-knowledge-base/ 
avatar_knowledge_base.json. 

5 https: //github.com/UTS-AAi/AVATAR/tree/master /synthetic-datasets. 

6 https: //github.com/UTS-AAi/AVATAR/blob/master/supplementary-documents/ 
avatar_algorithms.txt. 
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- f_input_i: transformed-feature i of the input dataset 
- f_output_i: transformed-feature i of the output dataset 


INITIALIZATION - Set all transformed-features 
in the capabilities and effects to 0 


- f_effect_i: transformed-feature i in the effects 
- f_cap_i: transformed-feature i in the capabilities 


Has next 
synthetic dataset? 


YES 


Calculate transformed-features of a synthetic dataset 


Has next machine 
learning algorithm? 


Store the capabilities and 
effects of the machine 

learning algorithms in the 
AVATAR knowledge base 


YES 
EXECUTION: Run the algorithm with the synthetic dataset 


Is execution successful? 


< > 


FIND CAPABILITIES 
For each transformed-feature in the capabilities 
IF f_input_i = 1, SET f_cap_i=1 


Is the algorithm 
a filter? 


SET PREDICTIVE_MODEL=1 
for the effects 


Has next 
transformed-feature 
in the effects? 


IF f_effect_i = 0, 
SET f_effect_i = f_output_i - f_input_i 


Fig. 1. Algorithm to generate the knowledge base for evaluating surrogate pipelines. 


current value is a default value, we set this effect-related transformed-feature 
equal the difference of the values of this transformed-feature of the output 
and input dataset. 


3.2 Evaluation of ML Pipelines 


The AVATAR, evaluates a ML pipeline by mapping it to its surrogate pipeline 
and evaluating this surrogate pipeline. BPMN is the most promising method to 
represent an ML pipeline. The reasons are that a BPMN-based ML pipeline can 
be executable, has a better interpretation of the pipeline in terms of control, 
data flows and resources for execution, as well as integrates into existing busi- 
ness processes as a subprocess. Moreover, we claim that a Petri net is the most 
promising method to represent a surrogate pipeline. The reason is that it is fast 
to verify the validity of a Petri net based simplified ML pipeline. 
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weka.classifiers. 
bayes.NaiveBayes 
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Fig. 2. Mapping a ML pipeline to its surrogate model. 


Mapping a ML Pipeline to Its Surrogate Model. The AVATAR maps a 
BPMN pipeline to a Petri net pipeline via three stages (Fig. 2). 


1. The structure of the BPMN-based ML pipeline is mapped to the respective 
structure of the Petri net surrogate pipeline. The start and end events are 
mapped to the start and end places respectively. The components are mapped 
to empty transitions. Empty places are put between all transitions. Finally, 
all flows are mapped to arcs. 

2. The values of transformed-features are calculated from the input dataset to 
form a transformed-feature matrix which is the input token in the start place 
of the surrogate pipeline. 

3. The transition functions are mapped from the components. In this stage, only 
the corresponding algorithm information is mapped to the transition function. 


Get all transformed-features - f_in_token_i: transformed-feature i stored in the input token 
stored in the input token - f_out_token_i: transformed-feature i stored in the output token 
p - f_effect_i: transformed-feature i in the effects 


- f_cap_i: transformed-feature i in the capabilities 


Has next transformed-feature 
(f_in_token_i)? 


Get the respective f_cap_i from 
the AVATAR knowledge base 


(f_in_token_i=1 && 
f_cap_i= 0) 


Invalid Component 


Get the respective f_effect_iin 
the AVATAR knowledge base 


Has next 
transformed-feature 
(f_in_token_i)? 


f_out_token_i = f_in_token_i + f_effect_i 


Fig. 3. Algorithm for firing a transition of the surrogate model. 
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Evaluating a Surrogate Model. The evaluation of a surrogate model will 
execute a Petri net pipeline. This execution starts by firing each transition of 
the Petri net pipeline and transforming the input token. As shown in Fig. 3, 
firing a transition consists of two tasks: (i) the evaluation of the capabilities 
of each component; and (ii) the calculation of the output token. The first task 
verifies the validity of the component using the following rules. If the value of a 
transformed-feature stored in the input token (f-in-token-i) is 1 and the corre- 
sponding transformed-feature in the component’s capabilities (f_cap_t) is 0, this 
component is invalid. Otherwise, this component is always valid. If a component 
is invalid, the surrogate pipeline is evaluated as invalid. The second task calcu- 
lates each transformed-feature stored in the output token (f_out_token_t) in the 
next place from the input token by adding the value of a transformed-feature 
stored in the input token (f_in_token_i) and the respective transformed-feature 
in the component’s effects (f_effect_). 


4 Experiments 


To investigate the impact of invalid pipelines on ML pipeline composition and 
optimisation, we have first conducted a series of experiments with current state- 
of-the-art AutoML tools. After that, we have conducted the experiments to 
compare the performance of the AVATAR and the existing methods. 


4.1 Experimental Settings 


Table 2 summarises characteristics of datasets’ used for experiments. We use 
these datasets because they were used in previous studies [2,4,5]. The AutoML 
tools used for the experiments are AutoWeka4MCPS [2] and Auto-sklearn [5]. 
These tools are selected because their abilities to construct and optimise hyper- 
parameters of complex ML pipelines have been empirically proven to be effective 
in a number of previous studies [2,5,16]. However, these previous experiments 


Table 2. Summary of datasets’ characteristics: the number of numeric attributes, 
nominal attributes, distinct classes, instances in training and testing sets. 


Dataset Numeric | Nominal | No. of distinct classes | Training | Testing 
abalone 7 1 26 2,924 1,253 
car 0 6 4 1,210 518 
convex 784 0 2 8,000 50,000 
gcredit 7 13 2 700 300 
wineqw 11 0 7 3,429 1,469 


T https: //archive.ics.uci.edu. 
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had not investigated the negative impact of the evaluation of invalid pipelines 
on the quality of the pipeline composition and optimisation yet. This is the goal 
of our first set of experiments. In the second set of experiments, we show that 
the AVATAR can significantly reduce the evaluation time of ML pipelines. 


4.2 Experiments to Investigate the Impact of Invalid Pipelines 


To investigate the impact of invalid pipelines, we use five iterations (Iter) for the 
first set of experiments. We run these experiments on AWS EC2 t3a.smail virtual 
machines which have 2 vCPU and 2GB memory. Each iteration uses a different 
seed number. We set the time budget to 1h and the memory to 1 GB. We evaluate 
the pipelines produced by the AutoML tools using three criteria: (1) the number 
of invalid/valid pipelines, (2) the total evaluation time of invalid/valid pipelines 
(seconds), and (3) the wasted evaluation time (%). The wasted evaluation time 
is calculated by the percentage of the total evaluation time of invalid pipelines 
over the total runtime of the pipeline composition and optimisation. The wasted 
evaluation time represents the degree of negative impacts of invalid pipelines. 
Tables 3 and 4 present negative impacts of invalid pipelines in ML pipeline 
composition and optimisation of AutoWeka4MCPS and Auto-sklearn using the 
above criteria. These tables show that not all of constructed pipelines are valid. 
Because AutoWeka4MCPS can compose pipelines which have up to six com- 
ponents, it is more likely to generate invalid pipelines and the evaluation time 


Table 3. Negative impacts of invalid pipelines in pipeline composition and optimi- 
sation of AutoWeka4MCPS. (1): the number of invalid/valid pipelines, (2): the total 
evaluation time of invalid/valid pipelines (s), (3): the wasted evaluation time (%). 


Dataset) Criteria/Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 
abalone|(1) [16/26 90/79 69/88 34/29 53/80 
(2) 3607.7/1322.5)2007.1/1236.4|4512.9/2172.3|3615.4/277.6 |23.2/3509.0 
(3) 73.18 61.88 67.51 92.87 0.66 
car |(1)~—«(205/152 108/70 197/313 139/156 85/64 
(2) [3818.1/291.8 3498.5/113.0 |4523.6/532.6 |5232.2/251.3 |4365.1/90.1 
(3) 92.90 96.87 89.47 95.42 97.98 
convex |(1) 18/20 2/0 17/11 crashed crashed 
(2) 76.3/3588.1 3475.2/0.0 1324.7/2331.8 
(3) 2.08 100.00 36.23 
gcredit |(1) 112/195 229/364 208/166 12/54 30/54 
(2) 2821.0/2260.1)3829.8/285.6 |3933.8/184.0 |3667.6/34.1 |3634.8/64.7 
(3) 55.52 93.06 95.53 99.08 98.25 
wineqw |(1) 203/213 121/139 crashed 201/302 36/54 
(2) 4880.6/1052.9 4183.4/1078.6 2418.5 /1132.2/1639.2/862.2 
(3) 82.26 79.50 68.11 65.53 
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Table 4. Negative impacts of invalid pipelines in pipeline composition and optimisation 
of Auto-sklearn. (1): the number of invalid/valid pipelines, (2): the total evaluation time 
of invalid/valid pipelines (s), (3): the wasted evaluation time (%). 


Dataset|Criteria/Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 
abalone crashed crashed crashed crashed crashed 
car crashed crashed crashed crashed crashed 
convex |(1) 2/13 2/6 2/8 2/6 2/8 
(2) 560.8/2981.8|537.7/629.2|584.1/1537.5/558.1/977.1/560.0/1655.9 
(3) 15.76 15.07 16.39 15.66 15.72 
gcredit crashed crashed crashed crashed crashed 
wineqw |(1) 0/42 0/22 0/42 0/32 0/32 
(2) 0.0/3523.4 |0.0/909.7 |0.0/3197.4 |0.0/3054.0 |0.0/3163.5 
(3) 0.00 0.00 0.00 0.00 0.00 


of these invalid pipelines are significant. For example, the wasted evaluation 
time is 97.98% in the case of using the dataset car and Iter 5. We can see that 
changing the different random iterations has a strong impact on the wasted 
evaluation time in the case of AutoWeka4MCPS. For example, the experiments 
with the dataset abalone show that the wasted evaluation time is in the range 
between 0.66% and 92.87%. The reason is that Weka libraries them-self can eval- 
uate the compatibility of a single component pipeline without execution. If the 
initialisation of the pipeline composition and optimisation with a specific seed 
number results in pipelines consisting of only one predictor, and these pipelines 
are well-performing, it tends to exploit similar ML pipelines. As a result, the 
wasted evaluation time is low. However, this impact is negligible in the case of 
Auto-sklearn. The reason is that Auto-sklearn uses meta-learning to initialise 
with promising ML pipelines. The experiments with the datasets abalone, car 
and gcredit show that Auto-sklearn limits the generation of invalid pipelines by 
making assumption about cleaned input datasets, because the experiments crash 
if the input datasets have multiple attribute types. It means that Auto-sklearn 
can not handle invalid pipelines effectively. 


4.3 Experiments to Compare the Performance of AVATAR and the 
Existing Methods 


In order to demonstrate the efficiency of the AVATAR, we have conducted a 
second set of experiments. We run these experiments on a machine with an 
Intel core i7-8650U CPU and 16GB memory. We compare the performance of 
the AVATAR and the T-method that requires the executions of pipelines. The 
T-method is used to evaluate the validity of pipelines in the pipeline compo- 
sition and optimisation of AutoWeka4MCPS and Auto-sklearn. We randomly 
generate ML pipelines which have up to six components (i.e., these component 
types are missing value handling, dimensionality reduction, outlier removal, data 
transformation, data sampling and predictor). The predictor is put at the end 
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Table 5. Comparison of the performance of the AVATAR and T-method 


Dataset abalone | car convex gcredit winequality 
T-method | Invalid/valid | 683/ 4,387 / 252/ 4,557 / 1,276/ 
pipelines 1,097 6,817 428 7,208 1,951 
Total 27,711.9/) 18,627.9/ | 5,818.3/ 19,597.9/ | 10,830.1/ 
evaluation 15,484.1  24,459.4 | 37,765.1 | 23,452.5 | 32,326.9 
time of 
invalid /valid 
pipelines (s) 
AVATAR, | Invalid/valid | 663/ 4,387 / 250/ 4,552/ 1,262/ 
pipelines 1,117 6,817 430 7,213 1,965 
Total 3.5/4.9 | 43.1/64.8 | 19.6/131.1 | 57.0/89.2 | 17.1/25.4 
evaluation 
time of 
invalid /valid 
pipelines (s) 
Pipelines have 20/1,760 | 0/11,204 | 2/678 5/11,760 | 14/3,213 
different /similar 
evaluated results 
The percentage of 98.88 100.00 99.71 99.96 99.57 
pipelines that the 
AVATAR can validate 
accurately (%) 


of the pipelines because a valid pipeline always has a predictor at the end. Each 
pipeline is evaluated by the AVATAR, and the T-method. We set the time budget 
to 12h per dataset. We use the following criteria to compare the performance: 
the number of invalid/valid pipelines, the total evaluation time of invalid/valid 
pipelines (seconds), the number of pipelines that have the same evaluated results 
between the AVATAR and the T-method, and the percentage of the pipelines 
that the AVATAR, can validate accurately (%) in comparison to the T-method. 

Table5 compares the performance of the AVATAR and the T-method using 
the above criteria. We can see that the total evaluation time of invalid/valid 
pipelines of the AVATAR is significantly lower than the T-method. While the 
evaluation time of pipelines of the AVATAR is quite stable, the evaluation time 
of pipelines of the T-method is much higher and depends on the size of the 
datasets. It means that the AVATAR is faster than the T-method in evaluating 
both invalid and valid pipelines regardless of the size of datasets. Moreover, we 
can see that the accuracy of the AVATAR is approximately 99% in comparison 
with the T-method. We have carefully reviewed the pipelines which have different 
evaluated results between the AVATAR and the T-method. Interestingly, the 
AVATAR evaluates all of these pipelines to be valid and vice versa in the case of 
the T-method. The reason is that executions of these pipelines cause the out of 
memory problem. In other words, the AVATAR does not consider the allocated 
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memory as an impact on the validity of a pipeline. A promising solution is 
to reduce the size of an input dataset by adding a sampling component with 
appropriate hyperparameters. If the sampling size is too small, we may miss 
important features. If the sampling size is large, we may continue to run into the 
problem of out of memory. We cannot conclude that if we allocate more memory, 
whether the executions of these pipelines would be successful or not. It proves 
that the validity of a pipeline also depends on its execution environment such as 
memory. These factors have not been considered yet in the AVATAR. This is an 
interesting research gap that should be addressed in the future. 


Table 6. Five invalid pipelines with the longest evaluation time using the T-method 
on the gcredit dataset. 


Pipeline #1 #2 #3 #4 #5 
T-method (s) 11.092 | 11.068 | 11.067 | 11.067 | 11.066 
AVATAR (s) | 0.014| 0.012| 0.011| 0.011 0.011 


Finally, we take a detailed look at the invalid pipelines with the longest eval- 
uation time using the T-method on the gcredit dataset, as shown in Table 6. 
Pipeline #1 (11.092 s) has the structure ReplaceMissing Values — PeriodicSam- 
pling —> NumericToNominal — PrincipalComponents — SMOreg. This pipeline 
is invalid because SMOreg does not work with nominal classes, and there is 
no component transforming the nominal to numeric data. We can see that the 
AVATAR is able to evaluate the validity of this pipeline without executing it in 
just 0.0148. 


5 Conclusion 


We empirically demonstrate the problem of generation of invalid pipelines dur- 
ing pipeline composition and optimisation. We propose the AVATAR which is a 
pipeline evaluation method using a surrogate model. The AVATAR can be used 
to accelerate pipeline composition and optimisation methods by quickly ignor- 
ing invalid pipelines to improve the effectiveness of the AutoML optimisation 
process. In future, we will improve the AVATAR to evaluate pipelines’ quality 
besides their validity. Moreover, we will investigate how to employ the AVATAR 
to reduce search spaces dynamically. 
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Abstract. This paper presents a new approach to the detection of dis- 
continuities in the n-th derivative of observational data. This is achieved 
by performing two polynomial approximations at each interstitial point. 
The polynomials are coupled by constraining their coefficients to ensure 
continuity of the model up to the (n — 1)-th derivative; while yielding an 
estimate for the discontinuity of the n-th derivative. The coefficients of 
the polynomials correspond directly to the derivatives of the approxima- 
tions at the interstitial points through the prudent selection of a common 
coordinate system. The approximation residual and extrapolation errors 
are investigated as measures for detecting discontinuity. This is neces- 
sary since discrete observations of continuous systems are discontinuous 
at every point. It is proven, using matrix algebra, that positive extrema 
in the combined approximation-extrapolation error correspond exactly 
to extrema in the difference of the Taylor coefficients. This provides a 
relative measure for the severity of the discontinuity in the observational 
data. The matrix algebraic derivations are provided for all aspects of 
the methods presented here; this includes a solution for the covariance 
propagation through the computation. The performance of the method 
is verified with a Monte Carlo simulation using synthetic piecewise poly- 
nomial data with known discontinuities. It is also demonstrated that the 
discontinuities are suitable as knots for B-spline modelling of data. For 
completeness, the results of applying the method to sensor data acquired 
during the monitoring of heavy machinery are presented. 


Keywords: Data analysis - Discontinuity detection - Free-knot splines 


1 Introduction 


In the recent past physics informed data science has become a focus of research 
activities, e.g., [9]. It appears under different names e.g., physics informed [12]; 
hybrid learning [13]; physics-based [17], etc.; but with the same basic idea of 
embedding physical principles into the data science algorithms. The goal is to 
ensure that the results obtained obey the laws of physics and/or are based on 
physically relevant features. Discontinuities in the observations of continuous 
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systems violate some very basic physics and for this reason their detection is of 
fundamental importance. Consider Newton’s second law of motion, 


FO =F {mo Gul} = me) O + md O. () 


Any discontinuities in the observations of m(t), m(t), y(t), y(t) or y(t) indi- 
cate a violation of some basic principle: be it that the observation is incorrect 
or something unexpected is happening in the system. Consequently, detecting 
discontinuities is of fundamental importance in physics based data science. A 
function s(x) is said to be C” discontinuous, if s € C"~1\C”, that is if s(x) has 
continuous derivatives up to and including order n — 1, but the n-th derivative 
is discontinuous. Due to the discrete and finite nature of the observational data, 
only jump discontinuities in the n-th derivative are considered; asymptotic dis- 
continuities are not considered. Furthermore, in more classical data modelling, 
C” jump discontinuities form the basis for the locations of knots in B-Spline 
models of observational data [15]. 


1.1 State of the Art 


There are numerous approaches in the literature dealing with estimating regres- 
sion functions that are smooth, except at a finite number of points. Based on the 
methods, these approaches can be classified into four groups: local polynomial 
methods, spline-based methods, kernel-based methods and wavelet methods. The 
approaches vary also with respect to the available a priori knowledge about the 
number of points of discontinuity or the derivative in which these discontinuities 
appear. For a good literature review of these methods, see [3]. The method used 
in this paper is relevant both in terms of local polynomials as well as spline-based 
methods; however, the new approach requires no a priori knowledge about the 
data. 

In the local polynomial literature, namely in [8] and [14], ideas similar to the 
ones presented here are investigated. In these papers, local polynomial approx- 
imations from the left and the right side of the point in question are used. The 
major difference is that neither of these methods use constraints to ensure that 
the local polynomial approximations enforce continuity of the lower derivatives, 
which is done in this paper. As such, they use different residuals to determine the 
existence of a change point. Using constrained approximation ensures that the 
underlying physical properties of the system are taken into consideration, which 
is one of the main advantages of the approach presented here. Additionally, in the 
aforementioned papers, it is not clear whether only co-locative points are con- 
sidered as possible change points, or interstitial points are also considered. This 
distinction between collocative and interstitial is of great importance. Funda- 
mentally, the method presented here can be applied to discontinuities at either 
locations. However, it has been assumed that discontinuities only make sense 
between the sampled (co-locative) points, i.e., the discontinuities are interstitial. 

In [11] on the other hand, one polynomial instead of two is used, and the focus 
is mainly on detecting C? and Ct discontinuities. Additionally, the number of 
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change-points must be known a-priori, so only their location is approximated; 
the required a-priori knowledge make the method unsuitable in real sensor based 
system observation. 

In the spline-based literature there are heuristic methods (top-down and 
bottom-up) as well as optimization methods. For a more detailed state of the 
art on splines, see [2]. Most heuristic methods use a discrete geometric measure 
to calculate whether a point is a knot, such as: discrete curvature, kink angle, 
etc, and then use some (mostly arbitrary) threshold to improve the initial knot 
set. In the method presented here, which falls under the category of bottom- 
up approaches, the selection criterion is based on calculus and statistics, which 
allows for incorporation of the fundamental physical laws governing the system, 
in the model, but also ensures mathematical relevance and rigour. 


1.2 The New Approach 


This paper presents a new approach to detecting C” discontinuities in obser- 
vational data. It uses constrained coupled polynomial approximation to obtain 
two estimates for the n*e? Taylor coefficients and their uncertainties, at every 
interstitial point. These correspond approximating the local function by polyno- 
mials, once from the left f(x, a) and once from the right g(x, B). The constraints 
couple the polynomials to ensure that a; = 3; for everyi € [0...n— 1]. In this 
manner the approximations are C”—! continuous at the interstitial points, while 
delivering an estimate for the difference in the nt” Taylor coefficients. All the 
derivations for the coupled constrained approximations and the numerical imple- 
mentations are presented. Both the approximation and extrapolation residuals 
are derived. It is proven that the discontinuities must lie at local positive peaks 
in the extrapolation error. The new approach is verified with both known syn- 
thetic data and on real sensor data obtained from observing the operation of 
heavy machinery. 


2 Detecting C” Discontinuities 


Discrete observations s(x;) of a continuous system s(x) are, by their very nature, 
discontinuous at every sample. Consequently, some measure for discontinuity will 
be required, with uncertainty, which provides the basis for further analysis. 
The observations are considered to be the co-locative points, denoted by x; 
and collectively by the vector x; however, we wish to estimate the discontinuity 
at the interstitial points, denoted by ¢; and collectively as ¢. Using intersti- 
tial points, one ensures that each data point is used for only one polynomial 
approximation at a time. Furthermore, in the case of sensor data, one expects 
the discontinuities to happen between samples. Consequently the data is seg- 
mented at the interstitial points, i.e. between the samples. This requires the use 
of interpolating functions and in this work we have chosen to use polynomials. 
Polynomials have been chosen because of their approximating, interpolating 
and extrapolating properties when modelling continuous systems: The Weier- 
strass approximation theorem [16] states that if f(x) is a continuous real-valued 
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function defined on the real interval x € [a,b], then for every € > 0, there 
exists a polynomial p(x) such that for all x € [a,b], the supremum norm 
| f(z) — p(z)|lo < £. That is any function f(x) can be approximated by a 
polynomial to an arbitrary accuracy € given a sufficiently high degree. 

The basic concept (see Fig. 1) to detect a C” discontinuity is: to approximate 
the data to the left of an interstitial point by the polynomial f(z, œ) of degree dr 
and to the right by g(x, B) of degree dr, while constraining these approximations 
to be C”~! continuous at the interstitial point. This approximation ensures that, 


FR-1)(¢,) 2 gD (G), for every k € [1...n]. (2) 


while yielding estimates for f(”) (¢;) and g™ (G) together with estimates for their 
variances A¢¢,) and Agi¢,). This corresponds exactly to estimating the Taylor 
coefficients of the function twice for each interstitial point, i.e., once from the 
left and once from the right. It they differ significantly, then the function’s nt? 
derivative is discontinuous at this point. The Taylor series of a function f(z) 
around the point a is defined as, 


Fig. 1. Schematic of a finite set of discrete observations (dotted circles) of a continuous 
function. The span of the observation is split into a left and right portion at the 
interstitial point (circle), with lengths lz and lr respectively. The left and right sides 
are considered to be the functions f(x) and g(x); modelled by the polynomials f(x, œ) 
and g(a, B) of degrees dz and dp. 


© Fk) (q 
fe) = (@— ay" (3) 


k=0 
for each x for which the infinite series on the right hand side converges. Further- 
more, any function which is n + 1 times differentiable can be written as 


f(x) = f(a) + R(a) (4) 
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where f(x) is an nt? degree polynomial approximation of the function f(z), 


- n EE) (q 
fe) => a-a) 6) 
k=0 


and R(x) is the remainder term. The Lagrange form of the remainder R(x) is 
given by 
Jan (€) n+1 
R(x) = ———~ (z — 6 
») = FO E-o (6) 
where € is a real number between a and z. 
A Taylor expansion around the origin (i.e. a = 0 in Eq. 3) is called a Maclaurin 
expansion; for more details, see [1]. In the rest of this work, the nt? Maclaurin 
coefficient for the function f(x) will be denoted by 


m & f™ (0) 
; so. 


n! 


(7) 


The coefficients of a polynomial f(x, œ) = anz” + ... + aia + ao are closely 
related to the coefficients of the Maclaurin expansion of this polynomial. Namely, 
it’s easy to prove that 


ak = t , for every k € [0...n]. (8) 


A prudent selection of a common local coordinate system, setting the interstitial 
point as the origin, ensures that the coefficients of the left and right approxi- 
mating polynomials correspond to the derivative values at this interstitial point. 
Namely, one gets a very clear relationship between the coefficients of the left and 
right polynomial approximations, œ and 6, their Maclaurin coefficients, m and 
im, and the values of the derivatives at the interstitial point 

_ 8” (0) 


n! 


n £0) 
ti = a 


and th) = B, 


(9) 


From Eq. 9 it is clear that performing a left and right polynomial approximation 
at an interstitial point is sufficient to get the derivative values at that point, as 
well as their uncertainties. 


3 Constrained and Coupled Polynomial Approximation 


The goal here is to obtain Ate Jä £{”) — ie via polynomial approximation. To 
this end two polynomial approximations are required; whereby, the interstitial 
point is used as the origin in the common coordinate system, see Fig. 1. The 
approximations are coupled [6] at the interstitial point by constraining the coef- 
ficients such that a; = ĝi, for everyi € [0...n — 1]. This ensures that the two 
polynomials are C”—! continuous at the interstitial points. This also reduces the 
degrees of freedom during the approximation and with this the variance of the 
solution is reduced. For more details on constrained polynomial approximation 
see [4,7]. 


Detection of Derivative Discontinuities in Observational Data 371 


To remain fully general, a local polynomial approximation of degree dz is 
performed to the left of the interstitial point with the support length lz cre- 
ating f(x, œ); similarly to the right dr, lr, g(a, 3). The x coordinates to the 
left, denoted as zy are used to form the left Vandermonde matrix Vz, simi- 
larly £r form Vp to the right. This leads to the following formulation of the 
approximation process, 


yL=VLa and yr=VpRB. (10) 
Ie val [al = [4 (11) 


A C”~? continuity implies a; = 6;, for every i € [0...n—1] which can be written 
in matrix form as 


[0 E1|0 -i | H =0 (12) 


Defining 


ve i AE HE H andC £ [0 In1|0 —In1] 


We obtain the task of least squares minimization with homogeneous linear con- 
straints, 


min |y- Vll 
Given Cy=0. (13) 


Clearly y must lie in the null-space of C; now, given N, an ortho-normal vector 
basis set for null {C}, we obtain, 


y=N6. (14) 
Back-substituting into Eq. 13 yields, 
min |y — V N ôl (15) 
The least squares solution to this problem is, 
56=(VN)* y, (16) 


and consequently, 


y= P =N (VN) y (17) 
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Formulating the approximation in the above manner ensures that the difference 


in the Taylor coefficients can be simply computed as 
(n) _ (n) OERE 
Age St = an = bn. 
Now defining d = [1, Oa,-1, —1, Oa,—1]", Ane is obtained from y as 


Att) = dy =d™N (VN)* y. 


3.1 Covariance Propagation 


(18) 


(19) 


Defining, K = N (V N)*, yields, y = K y. Then given the covariance of y, 


i.e., Ay, one gets that, 


A, = K A, K". 


(20) 


Additionally, from Eq. 19 one could derive the covariance of the difference in the 


Taylor coefficients 
Aa = dAd" 


(22) 


Keep in mind that, if one uses approximating polynomials of degree n to deter- 
mine a discontinuity in the n*e derivative, as done so far, A, is just a scalar 


and corresponds to the variance of Ane, 


4 Error Analysis 


In this paper we consider three measures for error: 


1. the norm of the approximation residual; 
2. the combined approximation and extrapolation error; 
3. the extrapolation error. 


4.1 Approximation Error 
The residual vector has the form 


r=y-Vy= Ck 


yr — Vr 
The approximation error is calculated as 
Eq = |Ir||2 = llyz — Villa + llyr — Ve |l2 


=(yr — Vra)” (yz — Vie) + (yr — VRB)” (yr — VRB) 
=y"y — 2a" V; yr + a" VF Vra — 287 Ve yr + BV VRB. 
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Fig. 2. Schematic of the approximations around the interstitial point. Red: left polyno- 
mial approximation f(x, a); dotted red: extrapolation of f(x, a) to the RHS; blue: right 
polynomial approximation, g(x, 3); dotted blue: extrapolation of g(x, 3) to the LHS; €; 
is the vertical distance between the extrapolated value and the observation. The approx- 
imation is constrained with the conditions: f(0,a@) = g(0,) and f’(0,a@) = g’(0, 8). 
(Color figure online) 


4.2 Combined Error 


The basic concept, which can be seen in Fig. 2, is as follows: the left polyno- 
mial f (2,a@), which approximates over the values Œz, is extended to the right 
and evaluated at the points xr. Analogously, the right polynomial g (x, 3) is 
evaluated at the points xz. If there is no C” discontinuity in the system, the 
polynomials f and g must be equal and consequently the extrapolated values 
won’t differ significantly from the approximated values. 


Analytical Combined Error. The extrapolation error in a continuous case, 
i.e. between the two polynomial models, can be computed with the following 
2-norm, 

e= | {fæ a) -800P ae. (23) 


Given, the constraints which ensure that a; = Bii € [0,...,n — 1], we obtain, 


zs / en he, Ae ae (24) 


Expanding and performing the integral yields, 


Umax Lmin 


2n+1 _ „2n+1 
2 
2n +1 \ 2) 


Ex = (On — Bn)? { 


Given fixed values for £min and maz across a single computation implies that 
the factor, 


Thar Bran. 
A TT (20) 


is a constant. Consequently, the extrapolation error is directly proportional to 
the square of the difference in the Taylor coefficients, 


Ex X (Qn — Bn x fairy" 3 (27) 
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Numerical Combined Error. In the discrete case, one can write the errors 
of f(a, a) and g(a, 8) as 


ep=y—f(a,a) and e= y- g(a, 8) (28) 


respectively. Consequently, one could define an error function as 
Etg = |ler — egll3 = ||(@n — bn) zll = (an — Uae = (an — by)? 5 x; (29) 


where z £ «.°n. From these calculations it is clear that in the discrete case the 
error is also directly proportional to the square of the difference in the Taylor 
coefficients and that Ey X €y. This proves that the numerical computation is 
consistent with the analytical continuous error. 


4.3 Extrapolation Error 


One could also define a different kind of error, based just on the extrapolative 
properties of the polynomials. Namely, using the notation from the beginning of 
Sect. 3, one defines 


re = YL — g8(£L, B) =yYL—VL6 and reg = yr—f(eR, a) = yr— Vra 
and then calculates the error as 


Ee= rife + TZ Teg 
= (yr — VLB)” (yr — VLB) + (yr — Vra)” (yr — Vra) 
=y"y — 26V yp + B' Vi VLE — 2a" Vp yr + a" V Vra. 


In the example in Sect. 5, it will be seen that there is no significant numerical 
difference between these two errors. 


5 Numerical Testing 


The numerical testing is performed with: synthetic data from a piecewise poly- 
nomial, where the locations of the C” discontinuities are known; and with real 
sensor data emanating from the monitoring of heavy machinery. 


5.1 Synthetic Data 


In the literature on splines, functions of the type y(x) = e77" are commonly 


used. However, this function is analytic and C™ continuous; consequently it was 
not considered a suitable function for testing. In Fig.3 a piecewise polynomial 
with a similar shape is shown; however, this curve has C? discontinuities at 
known locations. The algorithm was applied to the synthetic data from the 
piecewise polynomial, with added noise with o = 0.05 and the results for a single 
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case can be seen in Fig.3. Additionally, a Monte Carlo simulation with m = 
10000 iterations was performed and the results of the algorithm were compared 
to the true locations of the two known knots. The mean errors in the location 
of the knots are: ji, = (5.59 + 2.05) x 1074 with 95% confidence, and u2 = 
(—4.62 + 1.94) x 1074. Errors in the scale of 1074, in a support with a range 
(0, 1], and 5% noise amplitude in the curve can be considered a highly satisfactory 
result. 


Fig. 3. A piecewise polynomial of degree d = 2, created from the knots sequence 
£k = [0,0.3,0.7,1] with the corresponding values y = [0,0.3,0.7,1]. The end points 
are clamped with y‘(x)o,1 = 0. Gaussian noise is added with o = 0.05. Top: the circles 
mark the known points of C? discontinuity; the blue and red lines indicate the detected 
discontinuities; additionally the data has been approximated by the b-spline (red) using 
the detected discontinuities as knots. Bottom: shows Aue ) = 4 — 4”, together with 
the two identified peaks. (Color figure online) 


5.2 Sensor Data 


The algorithm was also applied to a set of real-world sensor data! emanating 
from the monitoring of heavy machinery. The original data set can be seen in 
Fig. 4 (top). It has many local peaks and periods of little or no change, so the 
algorithm was used to detect discontinuities in the first derivative, in order to 
determine the peaks and phases. The peaks in the Taylor differences were used in 
combination with the peaks of the extrapolation error to determine the points of 
discontinuity. A peak in the Taylor differences means that the Taylor coefficients 
are significantly different at that interstitial point, compared to other interstitial 
points in the neighbourhood. However, if there is no peak in the extrapolation 
errors at the same location, then the peak found by the Taylor differences is 
deemed insignificant, since one polynomial could model both the left and right 
values and as such the peak isn’t a discontinuity. Additionally, it can be seen in 


1 For confidentiality reasons the data has been anonymized. 
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Fig. 4. The top-most graph shows a function y(x), together with the detected C’ 
discontinuity points. The middle graph shows the difference in the Taylor polynomials 
A calculated at every interstitial point. The red and blue circles mark the relevant 
local maxima and minima of the difference respectively. According to this, the red and 
blue lines are drawn in the top-most graph. The bottom graph shows the approximation 
error evaluated at every interstitial point. (Color figure online) 
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Fig. 5. The two error functions, Fe and Eg as defined in Sect. 4, for the example from 
Fig. 4. One can see that the location of the peaks doesn’t change, and the two errors 
don’t differ significantly. 


Fig. 5 that both the extrapolation error and the combined error, as defined in 
Sect. 4, have peaks at the same locations, and as such the results they provide 
do not differ significantly. 


6 Conclusion and Future Work 


It may be concluded, from the results achieved, that the coupled constrained 
polynomial approximation yield a good method for the detection of C” disconti- 
nuities in discrete observational data of continuous systems. Local peaks in the 
square of the difference of the Taylor polynomials provide a relative measure as 
a means of determining the locations of discontinuities. 

Current investigations indicate that the method can be implemented directly 
as a convolutional operator, which will yield a computationally efficient solution. 
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The use of discrete orthogonal polynomials [5,10] is being tested as a means of 
improving the sensitivity of the results to numerical perturbations. 
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Abstract. The application of feature engineering in classification prob- 
lems has been commonly used as a means to increase the classification 
algorithms performance. There are already many methods for construct- 
ing features, based on the combination of attributes but, to the best of 
our knowledge, none of these methods takes into account a particular 
characteristic found in many problems: causality. In many observational 
data sets, causal relationships can be found between the variables, mean- 
ing that it is possible to extract those relations from the data and use 
them to create new features. The main goal of this paper is to propose a 
framework for the creation of new supposed causal probabilistic features, 
that encode the inferred causal relationships between the target and the 
other variables. In this case, an improvement in the performance was 
achieved when applied to the Random Forest algorithm. 


Keywords: Causality - Causal discovery - Conditional probability - 
Feature engineering - Causal features 


1 Introduction 


In regular classification problems, a set of data, classified with a finite set of 
classes, is used as input so that the chosen classification algorithm can build a 
model, that represents the behaviour of the learning set. This classifier can have 
better or worse results, depending on the data and how the algorithm handles 
it. 

Nevertheless, in many problems, applying only machine learning algorithms 
may not be the answer [4]. Instead, the use of feature engineering can be a way 
of improving the performance of these algorithms. 

Feature engineering is a process by which new information is extracted from 
the available data, to create new features. These new features are related to 
the original variables, but also with the target variable, being a better repre- 
sentation of the knowledge embedded in the data, hence helping the algorithms 
achieve more accurate results [4]. These types of solutions are usually problem- 
related, being that one solution might work in one particular problem, but not 


© The Author(s) 2020 
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in the other. However, there is one particular characteristic common to many 
classification problems: causality. In observational data, there is the possibility 
of existing causal relationships between variables, especially in data related to 
medical problems (among others) [16,17]. This fact should be taken into consid- 
eration, for example when selecting or creating new features, since it can give 
clues to which variables are the most important to the problem. 

By definition, causality, more specifically causal discovery, relates to the 
search of possible cause-effect relationships between variables [13]. The appli- 
cation of causal discovery in the various tasks of machine learning may be chal- 
lenging, both at the level of the causal process or the sampling process to generate 
the observed data [9]. Despite this fact, this subject has been the focus of several 
researchers over the years, given the importance and the potential impact that 
the discovery of causal relationships between events can have in the problem- 
solving. In the words of Judea Pearl: “while probabilities encode our beliefs about 
a static world, causality tells us whether and how probabilities change when the 
world changes, be it by intervention or by an act of imagination” [20]. By dis- 
covering causal relationships, it is possible to uncover, not only correlations but 
also relations that explain how and why the variables behave the way they do. 

In this paper, we propose a framework to create new features for discrete 
data sets (discrete features + discrete target) based on the causal relationships 
uncovered in the data. These attributes are created through the generation of a 
causal network, using a modified version of PC [21], and posterior probabilistic 
analysis of the relations a target variable and the variables considered as relevant. 
The relevant variables can be chosen by two different methods: parents and 
children of the target and Markov blanket [19]. 

This paper is organised as follows: Sect. 2 describes some important defini- 
tions. Section 3 describes the proposed framework and Sect. 4 the results obtained 
in the tests. 


2 Background 


In this section, we introduce some important notations that are used throughout 
the document. 


2.1 PC 


PC is a constraint-based algorithm and was proposed by Spirtes et al. [21]. 
This algorithm relies on the faithfulness assumption (“If we have a joint prob- 
ability distribution P of the random variables in some set V and a DAG G 
=(V,E), (G,P) satisfies the faithfulness condition if, G entails all and only con- 
ditional independencies in P” [18]), meaning that all the independencies in a 
DAG (directed acyclic graph) need to respect the d-separation criterion [8]. 
This algorithm is divided into two phases. In the first phase, the algorithm 
starts with a fully connected undirected graph. It removes an edge if the two 
nodes are independent, i.e., if there is a set of nodes adjacent to both variables in 
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which they are conditionally independent [12]. One of the most applied statistical 
independence tests is G?, proposed by Spirtes et al. [21], and then used in non- 
causal Bayesian networks by Tsamardinos et al. [24]. 

In the second phase [12], the algorithm orients the edges by first searching 
for v-structures (A — B — C) and then by applying a set of rules, to create a 
completed partially directed acyclic graph (CPDAG), that is equivalent to the 
original one, where the faithfulness is respected. 


2.2 Cochran-Mantel-Haenszel Test 


The Cochran-Mantel-Haenszel test, [2] is an independence test that studies 
the influence of two variables on each other, and takes into account the pos- 
sible influence of other variables on this dependence, i.e., it searches for causal 
dependence [11]. 

There are two different versions of this test: the normal Cochran-Mantel- 
Haenszel test, which is used in 2 x 2 x K tables (being K the number of tables 
created), and the Generalised Cochran-Mantel-Haenszel tests, which is used in 
Ix J x K tables (being that I and J represent the number of categories in the 
studied variables, and K the number of layer categories [6]). 

It is important to note that these type of contingency tables (three-way 
tables) are representations of the association between two variables if the influ- 
ence of the other covariates is controlled. 

Since many causal discovery algorithms (for discrete data) are used in data 
sets that are composed by a mixture of binary and non-binary discrete variables, 
the normal Cochran-Mantel-Haenszel test for 2 x 2 x K contingency tables is not 
enough. In such cases, the generalised version of this test can be applied instead 
(Generalised Cochran-Mantel-Haenszel test Eq. (1) [15]). 


Qomu = G'Var{G|Ho}'G Gn = Br(ma — mn) 
G= 5 Gh Var{G|Ho} = 5 Var{G),|Ho} Bn = Ch, Q Rh (1) 
h h 


In the equations presented previously, Bp represents the product of Kronecker 
between Ch and Ra, Var the co-variance matrix, (nh — mh) the difference 
between the observed and the expected, Cp and Rp the columns scores and 
row scores respectively, and Hg! the null hypothesis. 


3 Framework 


In many machine learning problems, the application of only classification algo- 
rithms might not be the answer to obtain satisfactory results [4]. The application 


1 “For each of the separate levels of the co-variable set h = 1, 2, ..., q, the response 
variable is distributed at random with respect to the sub-populations, i.e. the data 
in the respective rows of the hin table can be regarded as a successive set of simple 
random samples of sizes { Nhi.} from a fixed population corresponding to the marginal 
total distribution of the response variable { Nh.j}.” [15]. 
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of feature engineering to the target data can be a way of improving such results. 
There are already several methods to improve the overall performance of an 
algorithm through the creation or modification of attributes, but, to the best of 
our knowledge, none of them explores the potential causal relationships between 
the target variable and the other variables. 

The addition of these new inferred causal attributes may help improve 
the performance of classification algorithms, since they encode the relationship 
between the target and the other variables, thus feeding more information about 
the data set and its behaviour to the model. Moreover, these features may also 
aid in the generated models interpretability, since they encode the underlying 
relationships between the variables, thus being possible to explain more easily 
the decisions made by them. 

In this section, we present a new framework to create new features using 
causal probabilities retrieve from a model that represents the causal associations 
between variables. This framework can be divided into four different phases: 


1. Creation of the causal model (in this approach we suggest the usage of a 
modified version of PC); 
2. Identification of the relevant variables. These variables are directly related to 
the target variable: 
— They are its parents and children; 
— They belong to its Markov blanket (i.e. parents, children and spouses). 
3. Inference the probabilities associated with each pair {target variable, associ- 
ated variable}; 
4. Creation of the new features using this probabilities. The number of features 
should be: number of associated variables x number of classes. 


In the first step, the framework starts by creating a full causal model, that 
represents the causal associations between all the variables. This is done through 
the application of a modified version of PC [21]. In this modified version, the state 
of the art independence test (usually X? or G?) is replaced by the Generalised 
Cochran-Mantel-Haenszel test presented in Sect. 2.2. This test has the advantage 
(over X? and G?) of adjusting for confounding factors [22]. 

It is important to note that, in some cases, PC can’t direct every edge, hence 
it creates a CPDAG. In those cases, we apply a method to direct such edges. 
This method, proposed by Dor and Tarsi [5] searches recursively for possible 
ways to direct undirected edges. 

In the second step, the framework selects the relevant variables. To select 
these attributes, we propose two different approaches: parents and children and 
Markov blanket. 

In the parents and children (P-C) approach, as the name says, the variables 
selected are the ones that, in the causal graph, have an edge directed to the 
target (parents) or from the target (children). 

In the Markov blanket (MB) approach, both the parents and children of the 
target are selected, as well as the nodes that have edges directed to the child 
nodes (also called spouse nodes). It is important to note that the most common 
way to select the variables that influence the target is through Markov Blanket 
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Table 1. Example of probabilities generated by the probability queries 


Attr 


(often used in causal feature selection methods [10]). However, several authors 
proposed to use only parents and children, as these variables can be considered 
to be the ones with the most influence in the target within its Markov blanket 
i323]. 

In the third step, the framework infers a set of probabilities that represent 
the influence of each relevant variable on the classes of the target: posterior 
probability distribution (Eq. (2)). In these probabilistic queries, the objective 
is to find what the influence that a evidence (particular values of the relevant 
variable) has on the value of the target [14]. This is performed for all the values 
in each variable and the resulting probability matrix is similar to Table 1. 


OCCUTTENCEStNa 


P(Target = t|Attr = a) = (2) 


OCOCCUTTENCESa 

Finally, in the fourth step, the new features are created and added to the 
data set. Each new feature represents the probability of the relevant variables 
influence on a specific class, i.e., if we have, for example, a target variable with 
two classes ({0,1}) and a relevant variable Attr, there will be created two new 
features representing the influence of Attr in each class (each instance of the 
feature represent the, influence the value of Attr in that instance on the class 
represented in that feature). 

An overview of the framework can be seen in Fig. 1. 


3.1 An Illustrative Example 


To explain in more detail how this approach works, we will use as example a 
data set with 6 discrete variables (A, B, C, D, E and F), with 5000 instances. 
The values for variables A, B, C, D, and E can be {0,1,2}, while F can have the 
values {0,1}. For this example, we will use variable B as the target. 

As it was explained in Step 1, the approach starts by generating the full 
network with PC and Generalised Cochran Mantel Haenszel. The generated 
network can be seen in Fig. 2. 

After the creation of the full network, the relevant variables are selected. The 
selected variables can be parents or children (P-C) of B ({A, £}) or the Markov 
blanket (MB) of B ({A, E, F}). 
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Fig. 1. Example of the operation of the proposed framework 


In the third step of the framework generates inference probabilities for the 
chosen variables (Table 2). Taking A = 0 and B = 0 as an example, the proba- 
bilities are obtained for each one of the target values are obtained by by dividing 
the number of times both A = 0 and B = 0 occur, by the number of times A = 
0 occurs, or in other words P(B = 0| A = 0) = 0.86. 

These probabilities are then added to the global data set. The resulting data 
set is similar to Table 3. There is a difference between the number of new features 
created, since the number of generated features is equal to the product between 
the number of values in the target and the number of relevant variables. Since 
the MB approach selects more variables than the P-C approach, in theory, the 
number of generated features will be higher. So, in the case of P-C features we 
have 6 new features and in the case of MB we generate 9 new features. 


4 Results and Discussion 


To evaluate the proposed approaches and make a comparative study, the fol- 
lowing configuration of experiments was designed: the performance of Random 
Forest, using the original data, as well has the versions generated by the two 
proposed approaches were compared. 
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Fig. 2. Example: network generated 


Table 2. Probabilities generated for the Markov blanket variables. In parents and 
children’s case, the probabilities for F are not generated. 


A E F 
0 1 2 0 1 2 0 1 


Target 


0.86 0.45 0.11 
0.03 0.22 0.09 
0.11 0.32 0.78 


0.74 0.46 0.15 
0.08 0.11 0.16 
0.19 0.44 0.68 


0.47 0.48 
0.11 0.12 
0.41 0.41 


This comparative analysis was made through 10-fold cross validation, in sev- 
eral public data sets (Table 4). For each fold, the two approaches are applied to 
the train set and then the resulting conditional probabilities are used to create 
the new features for both the train and test set (this ensures that no information 
about the classes in the test set is added to the new features). 

To choose the optimal parameters for the approaches presented in the fol- 
lowing sections, a sensitivity analysis was performed. This analysis consisted of 
obtaining the error (1 - accuracy) for the presented data sets (by dividing them 
into 70% train, 30% test). In the case of PC this test was repeated for significance 
levels 1% and 5%. In these tests we concluded that the error of the algorithms 
in the data sets did not change much when the parameters were changed. For 
this reason, for all the data sets we select and present a significance level of 5%. 

The performance of this algorithm was compared in terms of error rate 
(Table 5). This comparison was performed using the No new features as a refer- 
ence. The classification algorithm performance, trained with causal features in 
each data set were compared to the reference using the Wilcoxon signed ranked- 
test. The sign +/— indicates that algorithm is significantly better/worse than 
the reference with a p-value of less than 0.05. Besides this, the algorithms are 
also compared in terms of average and geometric mean of the errors, average 
ranks, average error ratio, win/losses, significant win/losses (number of times 
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Table 3. Features generated with the probabilities for Markov blanket variables. In 
parents and children’s case, the features related with F are not generated. 
A|B/C|D|E|F|B=2 AIB=0 AB=1 AIB 2 E/\B=0 E|B 1 E|B=2 & FB=0 & F|B=1 & F 
1|2/1|0ļ|1|1| 0.35 | 0.44 | 0.22 0.44 | 0.45 | 0.10 0.41 0.48 0.12 
1joj2joj1|1| 0.35 | 0.44 | 0.22 0.44 | 0.45 | 0.10 0.41 0.48 0.12 
0/0/0|0|0jO| 0.11 0.87 0.02 0.19 0.73 0.08 0.41 0.47 0.11 
0/0/0|0|1|1| 0.11 0.87 0.02 0.44 0.45 0.10 0.41 0.48 0.12 
0|/0/1|2|0jO| 0.11 0.87 0.02 0.19 0.73 0.08 0.41 0.47 0.11 
Table 4. Data set description 
Data set Number of | Number of Number of classes 
examples | attributes 
breast cancer 286 10 0(70%) 1(30%) 
cervical 858 16 0(94%) 1(6%) 
corral 160 7 0(56%) 1(44%) 
earthquake 10000 5 0(2%) 1(98%) 
head injury 3121 11 0(92%) 1(8%) 
lucas 2000 12 0(28%) 1(72%) 
medpar 1495 9 0(66%) 1(34%) 
mifem 1275 10 0(25%) 1(75%) 
qualitative bankruptcy 250 7 0(43%) 1(57%) 
respiratory 555 5 0(51%) 1(49%) 
survey 10000 6 0(56%) 1(28%) 2(16%) 
titanic 1316 0(62%) 1(38%) 
xd6 973 10 0(67%) 1(33%) 


that the reference was better or worse than the algorithm, using signed ranked- 
test) and the Wilcoxon signed ranked-test. For the Wilcoxon signed ranked-test 
we consider also a p-value of 0.05. 

If we analyse Table 5, it is possible to see that, in general, +Causal features 
P-C(the addition of features representing the conditional probability of parents 
and children features on the target) has a better performance than No new 
features, since the value obtained in the Wilcoxon test is 0.0266 (less then the 
p-value of 0.05), which means that the difference between the performance is 
significant. This difference can also be seen in the values of the average and 
geometric ranks. More specifically, if we look at the average ranks, we can see 
that +Causal features P-C has lower ranks (in average) than No new features 
(1.436 against 2.538). 

If we now compare the second approach proposed (+Causal features MB) 
with the reference, we can see that there is a positive difference in the results 
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Table 5. Error rates of Random Forest for classification with causal features 


Data set No new features |+Causal features P-C |+Causal features MB 
breast cancer 28.6 + 9.88 28.6 + 7.49 28 + 8.39 
cervical 6.88 + 1.51 6.65 + 1.66 6.53 + 1.49 
corral 5.62 + 5.47 + 0.01 + 0.10 + 0.01 + 0.10 
earthquake 0.26 + 0.14 0.20 + 0.14 0.20 + 0.14 
head injury 7.08 + 1.23 7.43 + 0.83 7.05 + 0.69 
lucas 15.2 + 2.02 14.5 + 2.12 14.5 + 2.12 
medpar 32.70 + 4.29 33.00 + 3.91 34.10 + 3.23 
mifem 20.1 + 4.28 20.00 + 4.30 19.9 + 3.63 
qualitative bankruptcy 0.40 + 1.26 0.01 + 0.10 0.80 + 2.53 
respiratory 40.90 + 6.79 40.20 + 6.20 41.2 + 6.90 
survey 44.60 + 2.26 44.4 + 2.05 44.4 + 2.05 
titanic 21.4 + 2.52 20.20 + 2.19 20.5 + 1.83 
xd6 0.41 + 0.72 0.10 + 0.10 0.10 + 0.10 
Average Mean 17.242 16.562 16.715 
Geometric Mean 7.161 2.889 4.039 
Average Ranks 2.538 1.462 1.538 
Average Error Ratio il 0.764 0.914 
Wicoxon test 0.0266 0.1465 
Win/Losses 10/2 10/3 
Significant win/losses 1/0 1/0 


Table 6. AUC for Lucas data set 


AUC 
No new features 0.877 
+Causal features P-C | 0.887 
+Causal features MB | 0.889 


(although not significant). It is possible to see this difference, once again, in the 
average and geometric mean, as well as in the average rank (1.538). 

In Table6, it is possible to see the AUC values for the three analysed 
approaches, for lucas data set”. The results presented in this table were obtained 
by dividing this data set in train and test (70%/30%). The model scores were 
then obtained for the test data (with a 50% cutoff). 

In this table it is possible to see that +Causal features MB has the highest 
area, meaning that, in the data set with the causal probabilistic features that rep- 
resent the relations between the target and its Markov blanket, Random Forest 
can distinguish better the classes than with the data from the other approaches, 
thus having a better performance [7]. Although +Causal features MB was the 


? http: //www.causality.inf.ethz.ch/data/LUCAS.html. 
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best approach in terms of AUC, the other proposed approach +Causal features 
P-C also obtained an AUC higher than the reference. 

Finally, from these results, we can conclude that there is evidence that apply- 
ing causality to the creation of new features can have a positive impact on the 
classification algorithms performance. 


5 Conclusion 


The achievement of satisfactory results in a classification problem not only 
depends on the chosen classifier but also the data being processed. One possible 
way to improve the performance of classifiers is to apply feature engineering, 
or in other words, use the original data to infer new information, creating new 
attributes and altering others, to obtain more descriptive features. Furthermore, 
most of the proposed methodologies do not take into account the possible causal 
relationships in the data. This information can help to create more accurate 
models, since we are encoding in one variable, information about the interaction 
between variables, thus reinforcing their importance. 

In this paper we proposed a framework that uses causal discovery to create 
new features based on posterior probabilistic analysis of the relations between 
a target variable and the variables considered as relevant, being these variables 
the parents and children of the Markov Blanket of the target. 

In the experiments, we compared the approaches with the original data, using 
Random Forest in public data sets. From these results, we can conclude that 
there is evidence that the application of causality in the creation of new supposed 
probabilistic features may have a positive impact on the overall performance of 
the classification algorithm. 

In the future, we intend to study the application of these techniques in other 
classifiers, as well as in the classification of mixed data (continuous and discrete 
variables). 
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Abstract. Many image analysis tasks involve the automatic segmenta- 
tion and counting of objects with specific characteristics. However, we 
find that current approaches look to either segment objects or count 
them through bounding boxes, and those methodologies that both seg- 
ment and count struggle with co-located and overlapping objects. This 
restricts our capabilities when, for example, we require the area covered 
by particular objects as well as the number of those objects present, espe- 
cially when we have a large amount of images to obtain this information 
for. In this paper, we address this by proposing a Dual-Output U-Net. 
DO-U-Net is an Encoder-Decoder style, Fully Convolutional Network 
(FCN) for object segmentation and counting in image processing. Our 
proposed architecture achieves precision and sensitivity superior to other, 
similar models by producing two target outputs: a segmentation mask 
and an edge mask. Two case studies are used to demonstrate the capa- 
bilities of DO-U-Net: locating and counting Internally Displaced People 
(IDP) tents in satellite imagery, and the segmentation and counting of 
erythrocytes in blood smears. The model was demonstrated to work with 
a relatively small training dataset, achieving a sensitivity of 98.69% for 
IDP camps of the fixed resolution, and 94.66% for a scale-invariant IDP 
model. DO-U-Net achieved a sensitivity of 99.07% on the erythrocytes 
dataset. DO-U-Net has a reduced memory footprint, allowing for training 
and deployment on a machine with a lower to mid-range GPU, making it 
accessible to a wider audience, including non-governmental organisations 
(NGOs) providing humanitarian aid, as well as health care organisations. 


Keywords: Convolutional neural networks - U-Net - Segmentation - 
Counting - Satellite imagery - Blood smear 


1 Introduction 


Over recent years, the volumes of data collected across all industries globally have 
grown dramatically. As a result, we find ourselves in an ever greater need for fully 
automated analysis techniques. The most common approaches to large scale data 
analysis rely on the use of supervised and unsupervised Machine Learning, and, 


© The Author(s) 2020 
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increasingly, Deep Learning. Using only a small number of human-annotated 
data samples, we can train models to rapidly analyse vast quantities of data 
without sacrificing the quality or accuracy compared with a human analyst. In 
this paper, we focus on images - a rich datatype that often requires rapid and 
accurate analysis, despite its volumes and complexity. Object classification is one 
of the most common types of analysis undertaken. In many cases, a further step 
may be required in which the classified and segmented objects of interest need 
to be accurately counted. While easily performed by humans, albeit slowly, this 
task is often non-trivial in Computer Vision, especially in the cases where the 
objects exist in complex environments or when objects are closely co-located and 
overlapping. We look at two such case studies: locating and counting Internally 
Displaced People (IDP) shelters in Western Afghanistan using satellite imagery, 
and the segmentation and counting of erythrocytes in blood smear images. Both 
applications have a high impact in the real world and are in a need of new rapid 
and accurate analysis techniques. 


1.1 Searching for Shelter: Locating IDP Tents in Satellite Imagery 


Over 40 million individuals were believed to have been internally displaced glob- 
ally in 2018 [1]. Afghanistan is home to 2,598,000 IDPs displaced by conflict 
and violence, with the numbers growing by 372,000 in the year 2018 alone. In 
the same year, an additional 435,000 individuals were displaced due to natural 
disasters, 371,000 of whom were displaced due to drought. 

The internally displaced population receive aid from various non- 
governmental organisations (NGOs), to prevent IDPs from becoming refugees. 
The Norwegian Refugee Council (NRC) is one such body, providing humanitar- 
ian aid to IDPs across 31 countries, assisting 8.5 million people in 2018 [2]. In 
Afghanistan, the NRC has been providing IDPs with tents as temporary living 
spaces. Alcis is assisting the NRC with the analysis of the number, flow, and con- 
centration of these humanitarian shelters, enabling valuable aid to be delivered 
more effectively. 


Existing Methods. In the past, Geographical Information System (GIS) Tech- 
nicians relied mostly on industry-standard software, such as ArcGIS, for the 
majority of their analysis. These tools provide the user with a small number 
of built-in Machine Learning algorithms, such as the popularly used implemen- 
tation of the Support Vector Machine (SVM) algorithm [3]. These generally 
involve a time consuming, semi-automated process, with each step being revis- 
ited multiple times as manual tuning of the model parameters is required. The 
methodology does not allow for batch processing as all stages must be repeated 
with human input for each image. An example of the ArcGIS process! used by 
GIS technicians is: 


1. Manually segment the image by selecting a sample of objects exhibiting sim- 
ilar shape, spectral and spatial characteristics. 


' Details of the process have been provided by Alcis. 
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2. Train the image classifier to identify other examples similar to the marked 
sample, using a built-in machine learning model (e.g. SVM). 
3. Identify any misclassified objects and repeat the above steps. 


More recently, many GIS specialists have begun to look towards the latest 
techniques in Data Science and Big Data analysis to create custom Machine 
Learning solutions. A review paper by Quinn et al. in 2018 [4] weighed up the 
merits of Machine Learning approaches used to segment and count both refugee 
and IDP camps. Their work used a sample of 87,123 structures; a magnitude 
which was required for training using their approach and was seen as a limitation. 
Quinn et al. used the popular Mask R-CNN [5] segmentation model to analyse 
their data; a model using a Region Proposal Network to simultaneously classify 
and segment images. This yielded an average precision of 75%, improving to 78% 
by applying a transfer learning approach. 


1.2 Counting in Vein: Finding Erythrocytes in Blood Smears 


The counting of erythrocytes, or red blood cells, in blood smear images, is 
another application in which one must count complex objects. This is an impor- 
tant task in exploratory and diagnostic medicine, as well as medical research. 
An average blood smear imaged using a microscope, contains several hundred 
erythrocytes of varying size, many of which are overlapping, making an accurate 
manual count both difficult and time-consuming. 


Existing Methods. While only a small number of analyses were able to suc- 
cessfully perform an erythrocyte count, Tran et al. [6] have achieved a counting 
accuracy of 93.30%. Their technique relied on locating the cells using the Seg- 
Net [7] network. SegNet is an encoder-decoder style FCN architecture producing 
segmentation masks as its output. Due to the overlap of erythrocyte cells, they 
performed a Euclidean Distance Transform on the binary segmentation masks 
to obtain the location of each cell using a connected component labelling algo- 
rithm. A work by Alam and Islam [8] proposes an approach using YOLO9000 [9]; 
a network using a similar approach to Mask R-CNN, to locate elliptical bound- 
ing regions that roughly correspond to the outer contours of the cells. Using 
300 images, each containing a small number of erythrocytes, for training, they 
achieve an accuracy of 96.09%. Bounding boxes acted as ground-truth for Alam 
and Islam, as opposed to segmentation masks used by Tran et al. 


2 Data Description 


2.1 Satellite Imagery 


Working on the ground, the NRC identified areas within Western Afghanistan 
with known locations of IDP camps. Through their relationship with Maxar 
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[10], Alcis has access to satellite imagery covering multiple camps, in a range 
of different environments. Figure 1 shows a section of a camp in Qala’e’Naw, 
Badghis. 

This work uses imagery collected by the WorldView-2 and WorldView-3 satel- 
lites [11], by their operator and owner Maxar. WorldView-2 has a multispectral 
resolution of 1.85m, while the multispectral resolution of WorldView-3 is 1.24m 
[12], allowing tents of approximately 7.5m long and 4m wide to be resolved. 
The WorldView-2 images were captured on either 05/01/19 (DD/MM/YY) or 
03/03/19, with the WorldView-3 images captured on 12/03/19. A further set 
of images, observed between 08/08/18 and 23/09/19 by WorldView-3, became 
available for some locations. This dataset can be used to show evolution in the 
camps during this period, allowing for a better understanding of the changes 
undergone in IDP camps. Due to the orbital position of the satellite, images 
observed at different times have varying resolutions, as well as other properties, 
due to differences in viewing angle and atmospheric effects. 


Training Data. We developed DO-U-Net using a limited number of satellite 
images, obtained over a very limited time, with a nearly identical pixel scale. 
Each tent found in the training imagery has been marked with a polygon, using 
a custom Graphical User Interface (GUI) tool developed by Alcis. This has been 
done for a total of 6 images, covering an area of approximately 15km? and 
containing 5,178 tents. Incidentally, this makes our training dataset nearly 17 
times smaller than that used by Quinn et al. in their analysis. 

The second satellite dataset includes imagery of varying quality and resolu- 
tion, providing an opportunity to develop a scale-invariant version of our model. 
We used 3 additional training images, distinct from the original dataset, to train 
our scale-invariant DO-U-Net. These images contained 2,338 additional tents, in 
an area of around 130 km?, giving a total of 7,516 tents in over 140 km?. 


2.2 Blood Smear Images 


We used blood smear images from the Acute Lymphoblastic Leukemia (ALL) 
Image Database for Image Processing’. These images were captured using an 
optical laboratory microscope, with magnification ranging from 300-500, and 
a Canon PowerShot G5 camera. We used the ALL_IDB1 dataset, comprised of 
108 images taken during September 2005 from both ALL and non-ALL patients. 
An example blood smear image from an ALL patient can be seen in Fig. 2. 


Training Data. We selected 10 images from the ALL_IDBI1 dataset to be 
used as training data. These images are representative of the diverse nature of 
the entire dataset, including the varying microscope magnifications and back- 
grounds. Of the images used, 3 belong to ALL patients, with the remaining 7 


? Provided by the Department of Information Technology at Universita degli Studi di 
Milano, https://homes.di.unimi.it/scotti/all/. 


DO-U-Net 395 


Fig. 1. Left: An IDP camp in Badghis, 
Afghanistan. NRC tents are clearly vis- 
ible due to their uniform size and light 
colour. Right: The manually created 
ground-truth annotation for the image. 


Fig. 2. Left: An image of a blood 
smear from an Acute Lymphoblastic 
Leukemia (ALL) patient. Right: The 
manually created ground-truth anno- 
tation for the image. The images also 


contain lymphocytes, which are not 
marked in the training data. 


images coming from non-ALL patients. Similarly to the IDP camp dataset, all 
erythrocytes in the training data have been manually labelled with a polygon 
using our custom GUI tool. In the images belonging to ALL patients, a total of 
1,300 erythrocytes were marked. A further 3,060 erythrocytes were marked in 
the images belonging to non-ALL patients, giving a total of 4,360 erythrocytes 
in the training data. 

The training data does not distinguish between typical erythrocytes and 
those with any forms of morphology - of the 4,360 erythrocytes, just 106 display 
a clear morphology. The training data also does not contain any annotation 
for leukocytes. Instead, our focus is on correctly segmenting and counting all 
erythrocytes in the images. 


3 Methodology 


Of late, several very advanced and powerful Computer Vision algorithms have 
been developed, including the popular Mask R-CNN [5] and YOLO [9] architec- 
tures. While their performance is undoubtedly impressive, they rely on a large 
number of images to train their complex networks, as highlighted by Quinn et al. 
[5]. More recently, many more examples of FCN have been developed, including 
SegNet [7], DeconvNet [13] and U-Net [14], with the latter emerging as arguably 
one of the most popular encoder-decoder based architectures. Aimed at achiev- 
ing a high degree of success even with sparse training datasets and developed 
to tackle biological image segmentation problems, it is a clear starting block for 
our architecture. 


3.1 U-Net 


The classical U-Net, as proposed by Ronneberger et al. has revolutionised 
the field of biomedical image segmentation. Similarly to other encoder-decoder 
networks, U-Net is capable of producing highly precise segmentation masks. 
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Fig. 3. a: Sample segmentation mask in which some tent segmentations are seen to 
overlap. b: Sobel filtered image. c: Local entropy filtered image. d: Otsu filtered image. 
e: Image with contour ellipses applied. f: Image with gradient morphology applied. g: 
Eroded image. h: Tophat filtered image. i: Blackhat filtered image. 


What differentiates it from Mask R-CNN, SegNet and other similar networks is 
its lack of reliance on large datasets [14]. This is achieved by the introduction of 
a large number of skip connections, which reintroduce some of the early encoder 
layers into the much deeper decoder layers. This greatly enriches the information 
received by the decoder part of the network, and hence reduces the overall size 
of the dataset required to train the network. 

We have deployed the original U-Net on our dataset of satellite images of 
IDP camps in Western Afghanistan. While we were able to produce segmenta- 
tion masks that very accurately marked the location of the tents, the segmenta- 
tion masks contained significant overlaps between tents, as seen in Fig. 3. This 
overlap prevents us from carrying out an automated count, despite using several 
post-processing techniques to minimise the impact of these overlaps. The most 
successful post-processing approaches are shown in Fig. 3. The issues encountered 
with the classical U-Net have motivated our modifications to the architecture, 
as described in this work. 


3.2 DO-U-Net 


Driven by the need to reduce overlap in segmentation masks, we modified the 
U-Net architecture to produce dual outputs, thus developing the DO-U-Net. 
The idea of a contour aware network was first demonstrated by the DCAN 
architecture [15]. Based on a simple FCN, DCAN was trained to use the outer 
contours of the areas of interest to guide the training of the segmentation masks. 
This led to improved semantic and instance segmentation of the model, which 
in their case, looked at non-overlapping features in biomedical imaging. 

With the aim of counting closely co-located and overlapping objects, we 
are predominantly interested in the correct detection of individual objects as 
opposed to the exact precision of the segmentation mask itself. An examination 
of the hidden convolutional layers of the classical U-Net showed that the penulti- 
mate layer of the network extracts information about the edges of our objects of 
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Fig. 4. The DO-U-Net architecture, showing two output layers that target the segmen- 
tation and edge masks corresponding to the training images. 


interest, without any external stimulus. We introduce a secondary output layer 
to the network, targeting a mask segmenting the edges of our objects. By sub- 
tracting this “edge” mask from the original segmentation mask, we can obtain 
a “reduced” mask containing only non-overlapping objects. 

As our objective was to identify tents of fixed scale in our image dataset, we 
were able to simplify the model considerably. This reduced the computational 
requirements in training of the model, allowing not only for much faster devel- 
opment and training but also opening the possibility of deploying the algorithm 
on a dataset covering a large proportion of the total area of Afganistan, driven 
by our commercial requirements. 


Architecture. Starting with the classical U-Net, we reduce the number of 
convolutional layers and skip connections in the model. Simultaneously, we min- 
imised the complexity of the model by looking at smaller input regions of the 
images, thus minimising the memory footprint of the model. We follow the app- 
roach of Ronneberger et. al. by using unpadded convolutions throughout the 
network, resulting in a model with smaller output masks (100 x 100 px) corre- 
sponding to a central region of a larger (188 x 188 px) input image region. DO- 
U-Net uses two, independently trained, output layers of identical size. Figure 4 
shows our proposed DO-U-Net architecture. The model can also be found in 
our public online repository. Examples of the output edge and segmentation 
masks, as well as the final “reduced” mask, can be seen in Figs. 6 and 7. With 
the reduced memory footprint of our model, we can produce a “reduced” seg- 
mentation mask for a single 100 x 100 px region in 3ms using TensorFlow 2.0 
with Intel i9-9820X CPU and a single NVIDIA RTX 2080 Ti GPU setup. 


Training. The large training images were divided such that no overlap exists 
between the regions corresponding to the target masks, using zero-padding at 
the image borders. We train our model against both segmentation and edge 
masks. The edges of the mark-up polygons, annotated using our custom tool, 
were used as the “edge” masks during training. Due to the difference in a pixel 


3 https://github.com/ToyahJade/DO-U-Net. 
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Fig. 5. Scale-Invariant DO-U-Net, redesigned to work with datasets containing objects 
of variable scale. 


size of tents and erythrocytes, the edges were taken to span 2px and 4px wide 
respectively in these case studies. 

As our problem deals with segmentation masks covering only a small propor- 
tion of the image (<1% in some satellite imagery), the choice of a loss function 
was a very important factor. We use the Focal Tversky Index, which is suited for 
training with sparse positive labels compared to the overall area of the training 
data [16]. Our best result, obtained using the Focal Tversky loss function, gave 
an improvement of 5% in the Intersect-over-Union (IoU) value compared to the 
Binary Cross-Entropy loss function, as used by Ronneberger et al. [14]. We found 
the training to behave most optimally when the combined loss function for the 
model was heavily weighted toward the edge mask segmentation. Here, we used 
a 10%/90% split for the object and edge mask segmentation respectively. 


Counting. As the resulting “reduced” masks produced by our approach do 
not contain any overlaps, we can use simple counting techniques, relying on 
the detection of the bounding polygons for the objects of interest. We apply a 
threshold to remove all negative values from the image, which may occur due to 
the subtractions. We then use the Marching Squares Algorithm implemented as 
part of Python’s skimage.measure image analysis library [17]. 


Scale-Invariant DO-U-Net. In addition to the simple DO-U-Net, we pro- 
pose a scale-invariant version of the architecture with an additional encoder and 
decoder block. Figure 5 shows the increased depth of the network as is required 
to capture the generalised model of our objects in the scale varying dataset. 
The addition of extra blocks resulted in a larger input layer of 380 x 380 px, 
corresponding to a segmentation mask of 196 x 196 px. 


4 Results 


4.1 IDP Tent Results 


Using our DO-U-Net architecture, we were able to achieve a very significant 
improvement in the counting of IDP tents compared to the popularly used SVM 
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Fig. 6. Left: Segmentation mask produced for NRC tents in a camp near Qala’e’Naw. 
Centre: Edges mask produced for the same image. Right: The final output mask. 


classifier available in ArcGIS. However, due to the manually intensive nature of 
the ArcGIS approach“, we were only able to directly compare a single test camp, 
located in the Qala’e’Naw region of the Badghis Province. This area contains 
921 tents as identified in the ground-truth masks. Using DO-U-Net, we achieved 
a precision of 99.78% with a sensitivity of 99.46%. Using ArcGIS, we find a 
precision of 99.86% and a significantly lower sensitivity of 79.37%. Sensitivity, 
or the true positive rate, measures the probability of detecting an object and is, 
therefore, the most important metric for us as we aim to locate and count all 
tents in the region. The scale-invariant DO-U-Net achieved a precision of 98.48% 
and a sensitivity of 98.37% on the same image. 

We also apply DO-U-Net to a larger dataset of five images containing a total 
of 3,447 tents and find an average precision of 97.01% and an average sensitivity 
of 98.68%. Similarly, we tested the scale-invariant DO-U-Net using 10 images 
with varying properties and resolutions containing a total of 5,643 tents. Here, 
the average precision was reduced to 91.45%, and the average sensitivity dropped 
to 94.66%. This result is not surprising as, on inspection, we find that without 
the scale constraints the resulting segmenting masks are contaminated with other 
structures of similar properties to NRC tents. We also find that, without scale 
constraints, NRC tents which are partially covered e.g. with tarpaulin may be 
missed or only partially segmented. Our DO-U-Net and scale-invariant DO-U- 
Net sensitivities of 98.68% and 94.66% respectively are very strong results when 
compared to the existing literature. 


4.2 Erythrocyte Results 


To validate the performance of DO-U-Net at counting erythrocytes, we use 3 ran- 
domly selected blood smear images from ALL patients and a further 5 selected 
images from non-ALL patients. While randomly selected, the images are repre- 
sentative of the entire ALL_IDB1 dataset. On a total of 2,775 erythrocytes, as 
found in these 8 validation images, DO-U-Net achieved an average precision of 
98.31% and an average sensitivity of 99.07%. 


4 Results found using ArcGIS methodology can be found at https://storymaps.arcgis. 
com/stories/d85e5cca27464d97ad4clbad3da7{140. 
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Fig. 7. Left: Segmentation mask produced for a blood smear of an ALL patient. Centre: 
Edges mask produced for the same image. Right: The final output mask. 


í 


Fig. 8. Top: Blood smear images of overlapping cells. Bottom: Segmentation masks 
produced by DO-U-Net. Left: An overlapped cell is counted twice when the “edges” 
from neighbouring cells overlap and break up the cell. Centre: A cell is missed due to 
an incomplete edge mask. Right: An uncertainty in the order of the cell overlap leads 
to the intersect between two cells being counted as an additional cell. 


Whilst our proposed DO-U-Net is extremely effective at producing image 
and edge segmentation masks, as demonstrated in Fig.7, we do note that the 
obtained erythrocyte count may not always match the near-perfect segmenta- 
tion. Figure 8 shows examples of the three most common issues found to occur 
in the final “reduced” masks. These mistakes arise largely due to the translucent 
nature of erythrocytes and the difficulty in differentiating between a cell which 
is overlapping another and a cell which is overlapped. While these cases are rare, 
this demonstrates that further improvements can be made to the architecture. 


4.3 Future Work 


Our current model has been designed to segment only one type of object, which 
is a clear limitation of our solution. As an example, the blood smear images from 
the ALL-IDP1 dataset contain normal erythrocytes as well as two clear types 
of morphology: burr cells and dacryocytes. These morphologies may be signs of 
disease in patients, though burr cells are common artefacts, especially known to 
occur when the blood sample is aged. It is therefore important to not only count 
all erythrocytes, but to also differentiate between their various morphologies. 
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Table 1. Summary of results for DO-U-Net, when tested on our two satellite imagery 
datasets and the ALL_IDB1 dataset. 


Dataset Number of | Number of | Average | Average 
images objects precision | sensitivity 
IDP Camps (Fixed Scale) 5 3,447 97.01% | 98.69% 
IDP Camps (Scale-Invariant) | 10 5,643 91.45% | 94.66% 
ALL_IDB1 8 2,775 98.31% | 99.07% 


While our general theory can be applied to identifying different types of object, 
further modifications to our proposed DO-U-Net would be required. 


5 Conclusion 


We have proposed a new approach to segmenting and counting closely co-located 
and overlapping objects in complex image datasets. For this, we developed DO- 
U-Net: a modified U-Net based architecture, designed to produce both a seg- 
mentation and an “edge” mask. By subtracting the latter from the former, we 
can locate and spatially separate objects of interest before automatically count- 
ing them. Our methodology was successful on both of our case studies: locating 
and counting IDP tents in satellite imagery, and the segmentation and count- 
ing of erythrocytes in blood smear images. In the first case study, DO-U-Net 
increased our sensitivity by approximately 20% compared to a popular ArcGIS 
based solution, achieving an average sensitivity of 98.69% for a dataset of fixed 
spatial resolution. Our network went on to achieve a precision of 91.45% and a 
sensitivity of 94.66% on a set of satellite images with a varying resolution and 
colour profiles. This is an impressive result when compared to Quinn et al. who 
achieved a precision of 78%. We also found DO-U-Net to be extremely success- 
ful at segmenting and counting erythrocytes in blood smear images, achieving a 
sensitivity of 99.07% for our test dataset. This is an improvement of 6% over the 
results found by Tran et al. who used the same training dataset, and 3% over 
Alam and Islam who used a comparable set of images, giving us a near-perfect 
sensitivity when counting erythrocytes. The results are summarised in Table 1. 
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Abstract. Anorexia Nervosa (AN) is a serious mental disorder that has 
been proved to be traceable on social media through the analysis of users’ 
written posts. Here we present an approach to generate word embeddings 
enhanced for a classification task dedicated to the detection of Reddit 
users with AN. Our method extends Word2vec’s objective function in 
order to put closer domain-specific and semantically related words. The 
approach is evaluated through the calculation of an average similarity 
measure, and via the usage of the embeddings generated as features for 
the AN screening task. The results show that our method outperforms 
the usage of fine-tuned pre-learned word embeddings, related methods 
dedicated to generate domain adapted embeddings, as well as repre- 
sentations learned on the training set using Word2vec. This method can 
potentially be applied and evaluated on similar tasks that can be formal- 
ized as document categorization problems. Regarding our use case, we 
believe that this approach can contribute to the development of proper 
automated detection tools to alert and assist clinicians. 


Keywords: Social media - Eating disorders - Word embeddings - 
Anorexia Nervosa - Representation learning 


1 Introduction 


We present models to identify users with AN based on the texts they post 
on social media. Word embeddings previously learned in a large corpus, have 
provided good results on predictive tasks [3]. However, in the case of writings 
generated by users living with a mental disorder such as AN, we observe specific 
vocabulary exclusively related with the topic. Terms such as: “cw”, used to refer 
to the current weight of a person, or “ow” referring to the objective weight, 


This work was supported by the University of Lyon - IDEXLYON and the Spanish Min- 
istry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence 
Program (MDM-2015-0502). 

© The Author(s) 2020 


M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 404-417, 2020. 
https: //doi.org/10.1007/978-3-030-44584-3_32 


Anorexia Risk Assessment on Social Media 405 


are elements that are not easily found in large yet general collections extracted 
from Wikipedia, social media and news websites. Therefore, using pre-learned 
embeddings may not be the most suitable approach for the task. 

We propose a method based on Dict2vec [15] to generate word embeddings 
enhanced for our task domain. The main contributions of our work are the 
following: (1) a method that modifies Dict2vec [15] in order to generate word 
embeddings enhanced for our classification task, this method has the power to 
be applied on similar tasks that can be formulated as document categorization 
problems; (2) different ways to improve the performance of the embeddings gen- 
erated by our method corresponding to four embeddings variants; and (3) a 
set of experiments to evaluate the performance of our generated embeddings in 
comparison to pre-learned embeddings, and other domain adaptation methods. 


2 Related Work 


In previous work related to detection of mental disorders [8], documents were 
represented using bag of words (BoW) models, which involve representing words 
in terms of their frequencies. As these models do not consider contextual infor- 
mation or relations between the terms, other models have been proposed based 
on word embeddings [3]. These representations are generated considering the dis- 
tributional hypothesis, which assumes that words appearing in similar contexts 
are related, and therefore should have close representations [11,13]. 

Embedding models allow words from a large corpus to be encoded as vectors 
in a high-dimensional space. The vectors are defined by taking into account the 
context in which the words appear in the corpus in such a way that two words 
having the same neighborhood should be close in the vector space. 

Among the methods used for generating word embeddings we find 
Word2vec [11], which generates a vector for each word in the corpus consid- 
ering it as an atomic entity. To build the embeddings, Word2vec defines two 
approaches: one known as continuous bag of words (CBOW) that uses the con- 
text to predict a target word; and another one called skip-gram, which uses a 
word to predict a target context. Another method is fast Text [2], which takes into 
account the morphology of words, having each word represented as a bag of char- 
acter n-grams for training. There is also GloVe [13], which proposes a weighted 
least squares model that does the training on global word-word co-occurrence 
counts. 

In contrast to the previous methods, we can also mention recent methods 
like Embeddings from Language Models (ELMo) [14] and Bidirectional Encoder 
Representations from Transformers (BERT) [6] that generate representations 
which are aware of the context they are being used at. These approaches are 
useful for tasks where polysemic terms are relevant, and when there are enough 
sentences to learn these from the context. Regarding our use case, we observe 
that the vocabulary used by users with AN is very specific and contains almost no 
polysemic terms, which is why these methods are not addressed in our evaluation 
framework. 
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All the methods already mentioned are generally trained over large general 
purpose corpora. However, for certain domain specific classification tasks we 
have to work with small corpora. This is the case of mental disorders screening 
tasks given that the annotation phase is expensive, and requires the intervention 
of specialists. There are some methods that address this issue by either enhanc- 
ing the embeddings learned over small corpora with external information, or 
adapting embeddings learned on large corpora to the task domain. 

Among the enhancement methods we find Zhang’s et al. [17] work. They 
made use of word embeddings learned in different health related domains to 
recognize symptoms in psychiatry. They designed approaches to combine data 
of the source and target to generate word embeddings, which are considered in 
our experimental results. 

Kuang et al. [9] propose learning weights based on the words’ relative impor- 
tance for the classification task (predictive terms). This method proposes weight- 
ing words according to their x? [12] statistics to represent the context. However, 
this method differs from ours as we generate our embeddings through a different 
approach, which takes into account the context terms, introduces new domain 
related vocabulary, considers the predictive terms to be equally important, and 
moves apart the vectors of terms that are not predictive for the main target 
class. 

Faruqui et al. [7] present an alternative, known as a retrofitting method, 
which makes use of relational information from semantic lexicons to improve 
pre-built word vectors. The main disadvantage is that no external new terms 
representations can be introduced to the enhanced embeddings, and that despite 
related embeddings are put closer, the embeddings of terms that should not be 
related (task-wise) cannot be put apart from each other. In our experimental 
setup, this method is used to define a baseline and to enhance the embeddings 
generated through our approach. 

Our proposal is based on Dict2vec [15], which is an extension of the Word2vec 
approach. Dict2vec uses the lexical dictionary definitions of words in order to 
enrich the semantics of the embeddings generated. This approach has proved 
to perform well on small corpora because in addition to the context defined by 
Word2vec, it introduces a (1) positive sampling, which moves closer the vector 
of words co-occurring in their mutual dictionary definitions, and a (2) controlled 
negative sampling which prevents to move apart the vectors of words that appear 
in the definition of others, as the authors assume that all the words in the 
definition of a term from a dictionary are semantically related to the word they 
define. 


3 Method Proposed 


Our method generates word embeddings enhanced for a classification task dedi- 
cated to the detection of users with AN over a small size corpus. In this context, 
users are represented by documents that contain their writings concatenated, 
and that are labeled as anorexic (positive) or control (negative) cases. These 
labels are known as the classes to predict for our task. 
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Our method is based on Dict2vec’s general idea [15]. We extend the Word2vec 
model with both a positive and a negative component, but our method differs 
from Dict2vec because both components are designed to learn vectors for a 
specific classification task. Within the word embeddings context, we assume that 
word-level n-grams’ vectors, which are predictive for a class, should be placed 
close to each other given their relation with the class to be predicted. Therefore 
we first define sets of what we call predictive pairs for each class, and use them 
later for our learning approach. 


3.1 Predictive Pairs Definition 


Prior to learning our embeddings, we use x? [12] to identify the predictive n- 
grams. This is a method commonly used for feature reduction, being capable to 
identify the most predictive features, in this case terms, for a classification task. 

Based on the x? scores distribution, we obtain the n terms with the high- 
est scores (most predictive terms) for each of the classes to predict (positive 
and negative). Later, we identify the most predictive term for the positive class 
denoted as tı or pivot term. Depending on the class for which a term is predic- 
tive, two types of predictive pairs are defined, so that every time a predictive 
word is found, it will be put close or far from tı. These predictive pair types are: 
(1) positive predictive pairs, where each predictive term for the positive class is 
paired with the term tı in order to get its vector representation closer to t;; and 
(2) negative predictive pairs, where each term predictive for the negative class 
is also paired with tı, but with the goal of putting it apart from tı. 

In order to define the positive predictive terms for our use case, we con- 
sider: the predictive terms defined by the x? method, AN related vocabulary 
(domain-specific) and the k most similar words to tı obtained from pre-learned 
embeddings, according to the cosine similarity. Like this, information coming 
from external sources that are closely related with the task could be introduced 
to the training corpus. The terms that were not part of the corpus were appended 
to it, providing us an alternative to add new vocabulary of semantic significance 
to the task. 

Regarding the negative predictive terms, no further elements are considered 
asides from the (x?) predictive terms of the negative class as for our use case and 
similar tasks, control cases do not seem to share a vocabulary strictly related 
to a given topic. In other words, and as observed for the anorexia detection use 
case, control users are characterized by their discussions on topics unrelated to 
anorexia. 

For the x? method, when having a binary task, the resulting predictive fea- 
tures are the same for both classes (positive and negative). Therefore, we have 
proceeded to get the top n most predictive terms based on the distribution of 
the x? scores for all the terms. Later, we decided to take a look at the number of 
documents containing the selected n terms based on their class (anorexia or con- 
trol). Given a term t, we calculated the number of documents belonging to the 
positive class (anorexia) containing t, denoted as PCC; and we also calculated 
the number of documents belonging to the negative class (control) containing t, 
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named as NCC. Then, for t we calculate the respective ratio of both counts in 
relation to the total amount of documents belonging to each class: total amount 
of positive documents (TPD) and total amount of negative documents (TND), 
obtaining like this a positive class count ratio (PCCR) and a negative class count 
ratio (NCCR). 

For a term to be part of the set of positive predictive terms its PCCR value 
has to be higher than the NCCR, and the opposite applies for the terms that 
belong to the set of negative predictive pairs. The positive and negative class 
count ratios are defined in Eqs. la and 1b as: 


PCCR(t) = re) (1a) 
NCCR(t) = wee) (1b) 


3.2 Learning Embeddings 


Once the predictive pairs are defined, the objective function for a target term 
wą (Eq. 2) is defined by the addition of a positive sampling cost (Eq. 3) and a 
negative sampling cost (Eq. 4a) in addition to Word2vec’s usual target, context 
pair cost given by €(w:,w-) where £ represents the logistic loss function, and vz, 
and ve are the vectors associated to uw, and we respectively. 


ST (wt, We) = (v2, Ue) + Jpos(Wt) + Ineg(Wt) (2) 


Unlike Dict2vec, Jpos is computed for each target term where P(w) is the 
set of all the words that form a positive predictive pair with the word w+, and vy 
and v; are the vectors associated to w; and w; respectively. Gp is a weight that 
defines the importance of the positive predictive pairs during the learning phase. 
Also, as an aspect that differs from Dict2vec, the cost given by the predictive 
pairs is normalized by the size of the predictive pairs set, |P (w+)|, considering 
that all the terms from the predictive pairs set of w; are taken into account for 
the calculations, and therefore when tı is found, the impact of trying to move it 
closer to a big amount of terms is reduced, and it remains as a pivot element to 
which other predictive terms get close to: 


Lv $ vi) 


Jpos(wt) = Bp >, |P (w)| 


wiceP (w+) 


(3) 


On the negative sampling, we modify Dict2vec’s approach. We not only make 
sure that the vectors of the terms forming a positive predictive pair with w+ are 
not put apart from it, but we also define a set of words that are predictive for 
the negative class and define a cost given by the negative predictive pairs. In 
this case, as explained before, the main goal is to put apart these terms from 
tı, so this cost is added to the negative random sampling cost Jn_, (Eq. 4b), as 
detailed in Eq. 4a. 
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Jneglwt) = In (we) T Bn 5 


wj EN (w+) 


Jarl) = YD Uve v) (4b) 
wiE Fw) 
wig Pws) 

The negative sampling cost considers, as on Word2vec, a set F (w+) of k words 
selected randomly from the vocabulary. These words are put apart from w+ as 
they are likely to not be semantically related. Considering Dict2vec’s approach, 
we make sure as well that any term belonging to the set of positive predictive 
pairs of w; ends up being put apart. In addition to this, we add another negative 
sampling cost which corresponds to the cost of putting apart from tı the most 
predictive terms from the negative class. In this case, N (w+) represents the set 
of all the words that form a negative predictive pair with the word w+. Gy is 
a weight to define the importance of the negative predictive pairs during the 
learning phase. 

The global objective function (Eq. 5) is given by the sum of every pair’s cost 
across the whole corpus: 


J= X Iw, were) (5) 


t=1 c=— 


where C is the corpora size, and n represents the size of the window. 


3.3 Enhanced Embeddings Variations 


Given a pre-learned embedding which associates for a word w a pre-learned rep- 
resentation vp, and an enhanced embedding v obtained through our approach 
for w with the same length m as vp, we generate variations of our embeddings 
based on existing enhancement methods. First, we denote the embeddings gen- 
erated exclusively by our approach (predictive pairs) as Variation 0, v is an 
instance of the representation of w for this variation. 

For the next variations, we address ways to combine the vectors of pre- 
learned embeddings (i.e., vpı) with the ones of our enhanced embeddings (i.e., 
v). For Variation 1 we concatenate both representations vp +v, obtaining a 2m 
dimensions vector [16]. Variation 2 involves concatenating both representations 
and applying truncated SVD as a dimensionality reduction method to obtain 
a new representation given by SVD(v, + v). Variation 3 uses the values of 
the pre-learned vector vp as starting weights to generate a representation using 
our learning approach. This variation is inspired in a popular transfer learning 
method that was successfully applied on similar tasks [5]. For these variations 
(1-3) we take into account the intersection between the vocabularies of both 
embeddings types (pre-learned and Variation 0). Finally, Variation 4 implies 
applying Faruqui’s retrofitting method [7] over the embeddings of Variation 0. 
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4 Evaluation Framework 


4.1 Data Set Description 


We used a Reddit data set [10] that consists on posts of users labeled as anorexic 
and control cases. This data set was defined in the context of an early risk detec- 
tion shared task, and the training and test sets were provided by the organizers 
of the eRisk task.! Table 1 provides a description of the training and testing data 
sets statistics. Given the incidence of Anorexia Nervosa, for both sets there is a 
reduced yet significant amount of AN cases compared to the control cases. 


Table 1. Collection description as described on [10]. 


Train Test 

Anorexia | Control | Anorexia | Control 
Users count 20 132 41 279 
Writings count 7,452 77,514 |17,422 151,364 
Avg. writings count 372.6 587.2 424.9 542.5 
Avg. words per writing | 41.2 20.9 35.7 20.9 


4.2 Embeddings Generation 


The training corpus used to generate the embeddings, named anorexia corpus, 
consisted on the concatenation of all the writings from all the training users. A 
set of stop-words were removed. This resulted on a training corpus with a size of 
1,267,208 tokens and a vocabulary size of 87,197 tokens. In order to consider the 
bigrams defined by our predictive pairs, the words belonging to a bigram were 
paired and formatted as if they were a single term. 

For the predictive pairs generation with y?, each user is an instance rep- 
resented by a document composed by all the user’s posts concatenated. y? is 
applied over the train set considering the users classes (anorexic or control) as 
the possible categories for the documents. The process described in Sect. 3.1 is 
followed in order to obtain a list of 854 positive (anorexia) and 15 negative (con- 
trol) predictive terms. Some of these terms can be seen on Table 2, which displays 
the top 15 most predictive terms for both classes. Anorexia itself resulted to be 
the term with the highest x? score, denoted as tı in Sect. 3. 

The anorexia domain related terms from [1] were added as the topic related 
vocabulary, and the top 20 words with the highest similarity to anorexia coming 
from a set of pre-learned embeddings from Glo Ve [13] were also paired to it to 
define the predictive pairs sets. The GloVe’s pre-learned vectors considered are 
the 100 dimensions representations learned over 2B tweets with 27B tokens, and 
with 1.2M vocabulary terms. 


1 eRisk task: https://early.irlab.org/2018/index.html. 
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Table 2. List of some of the most predictive terms for each class. 


Positive Terms (Anorexia class) Negative terms (Control class) 
anorexia diagnosed binges war sky song 
anorexic macros calories don’t | bro plot master 

meal plan cal relapsed Trump | game Russian 
underweight weight gain restriction players|Earth — video 

eating disorder(s) | anorexia nervosa | caffeine gold | America| trailer 


The term anorexia was paired to 901 unique terms and, likewise, each of these 
terms was paired to anorexia. The same approach was followed for the negative 
predictive terms (15), which were also paired with anorezia. An instance of a 
positive predictive pair is (anorexia, underweight), whereas an instance of a neg- 
ative predictive pair is (anorexia, game). For learning the embeddings through 
our approach, and as it extends Word2vec, we used as parameters a window size 
of 5, the number of random negative pairs chosen for negative sampling was 5, 
and we trained with one thread/worker and 5 epochs. 


4.3 Evaluation Based on the Average Cosine Similarity 


This evaluation is done over the embeddings generated through Variation 0 over 
the anorexia corpus. It averages the cosine similarities (sim) between tı and all 
the terms that were defined either as its p positive predictive pairs, obtaining a 
positive score denoted as PS on Eq. 6a; or as its n negative predictive pairs, with 
a negative score denoted as NS on Eq. 6b. On these equations vg represents the 
vector of the term anorexia; vppr, represents the vector of the positive predictive 
term (PPT) i belonging to the set of positive predictive pairs of anorexia of size 
p; and uypr, represents the vector of the negative predictive term (NPT) 7 
belonging to the set of negative predictive pairs of anorexia of size n: 


PS(a) = i= ares (6a) 
NS(a) = Y;a] SiMm(Va, UNPT;) (6b) 


n 
We designed our experiments using PS and NS in order to analyze three 
main aspects: (1) we verify that through the application of our method, the 
predictive terms for the positive class are closer to the pivot term representation, 
and that the predictive terms for the negative class were moved away from it; 
(2) we evaluate the impact of using different values of the parameters Gp and 
By to obtain the best representations where PS has the highest possible value, 
keeping NS as low as possible; and (3) we compare our generation method with 
Word2vec as baseline since this is the case for which our predictive pairs would 
not be considered (Gp = 0 and By = 0). We expect for our embeddings to obtain 
higher values for PS and lower values for NS in comparison to the baseline. 
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Results. Table3 shows first the values for PS and NS obtained by what we 
consider our baseline, Word2vec (Gp = 0 and Gy = 0), and then the values 
obtained by embeddings models generated using our approach (Variation 0), 
with different yet equivalent values given to the parameters Gp and y, as they 
proved to provide the best results for PS and PN. We also evaluated individually 
the effects of varying exclusively the values for Gp, leaving Gy = 0, and then 
the effects of varying only the values of Gy, with Gp = 0. On the last row of 
the table we show a model corresponding to the combination of the parameters 
with the best individual performance (Gp = 75 and Gy = 25). 

After applying our approach the value of PS becomes greater than NS for 
most of our generated models, meaning that we were able to obtain a represen- 
tation where the positive predictive terms are closer to the pivot term anorexia, 
and the negative predictive terms are more apart from it. Then, we can also 
observe that the averages change significantly depending on the values of the 
parameters Gp and y, and for this case the best results according to PS are 
obtained when p = 50 and Gy = 50. Finally, when we compare our scores 
with Word2vec, we can observe that after applying our method, we can obtain 
representations where the values of PS and NS are respectively higher and lower 
than the ones obtained by the baseline model. 


Table 3. Positive Scores (PS) and Negative Scores (NS) for Variation 0. Different 
values for Gp and 8y are tested. 

Values for Gp and Bn Positive score (PS) | Negative score (NS) 

Bp = 0, Bn = 0 (baseline) 0.8861 0.8956 

Bp = 0.25, Bn = 0.25 0.7878 0.7424 

Bp = 0.5, Bw = 0.5 0.7916 0.5158 

Bp =1, Bn =1 0.7996 0.5879 

Bp = 10, Bn = 10 0.8495 0.4733 

Bp = 50, Bn = 50 0.9479 0.6009 

Bp = 100, Bn = 100 0.9325 0.6440 


4.4 Evaluation Based on Visualization 


We focus on the comparison of embeddings generated using word2vec (baseline), 
Variation 0 of our enhanced embeddings, and Variation 4. In order to plot over 
the space the vectors of the embeddings generated (see Fig. 1), we performed 
dimensionality reduction, from the original 200 dimensions to 2, through Prin- 
cipal Component Analysis (PCA) over the vectors of the terms in Table2 for 
the embeddings generated with these three representations. We focused over 
the embeddings representing the positive and negative predictive terms. For 
the resulting embeddings of our method (Variation 0), we selected Bp=50 and 
GBn=50 as parameter values. 
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Fig. 1. Predictive terms sample represented on two dimensions after PCA was applied 
on their embeddings as dimensionality reduction method. From left to right each plot 
shows the vectorial representation of the predictive terms according to the embeddings 
obtained through (1) Word2vec (baseline), (2) Variation 0, and (3) Variation 4. 


The positive predictive terms representations are closer after applying our 
method ( Variation 0), and the negative predictive terms are displayed farther, in 
comparison to the baseline. The last plot displays the terms for the embeddings 
generated through Variation 4. For this case, given the input format for the 
retrofitting method, anorexia was linked with all the remaining predictive terms 
of the anorexia class (901), and likewise, each of these predictive terms was linked 
to the term anorexia. Notice that the retrofitting approach converges to changes 
in Euclidean distance of adjacent vertices, whereas the closeness between terms 
for our approach is given by the cosine distance. 


4.5 Evaluation Based on the Predictive Task 


In order to test our generated embeddings for the classification task dedicated to 
AN screening, we conduct a series of experiments to compare our method with 
related approaches. We define 5 baselines for our task: the first one is a BoW 
model based on word level unigrams and bigrams (Baseline 1), this model is 
kept mainly as a reference since our main focus is to evaluate our approach com- 
pared to other word embedding based models. We create a second model using 
Glo Ve’s pre-learned embeddings (Baseline 2), and a third model that uses word 
embeddings learned on the training set with the Word2vec approach (Baseline 
3). We evaluate a fourth approach (Baseline 4) given by the enhancement of the 
Baseline 3 embeddings, with Faruqui’s et al. [7| retrofitting method. Baseline 
5 uses the same retrofitting method over GloVe’s pre-learned embeddings, as 
we expected that a domain adaptation of the embeddings learned on a external 
source could be achieved this way. 


Predictive Models Generation. To create our predictive models, again, each 
user is an instance represented by their writings (see Sect. 4.2). For Baseline 1 
we did a tf - idf vectorization of the users’ documents, by using the TfIdfVec- 
torizer provided by the Scikit-learn Python library, with a stop-words list and 
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the removal of the n-grams that appeared in less than 5 documents. The repre- 
sentation of each user through embeddings was given by the aggregation of the 
vector representations of the words in the concatenated texts of the users, nor- 
malized by the size (words count) of the document. Then, an Lo normalization 
was applied to all the instances. 

Given the reduced amount of anorexia cases on the training set, we used 
SMOTE [4] as an over-sampling method to deal with the unbalanced classes. The 
Scikit learn’s Python library implementations for Logistic regression (LR), Ran- 
dom Forest (RF), Multilayer Perceptron (MLP), and Support Vector Machines 
(SVM) were tested as classifiers over the training set with a 5-fold cross valida- 
tion approach. A grid search over each method to find the best parameters for 
the models was done. 


Results. The results of the baselines are compared to models with our varia- 
tions. For Variation 4 and baselines 4 and 5 we use the 901 predictive terms of 
Sect. 4.4. To define the parameters of Variation 3, we test different configura- 
tions, as on Sect. 4.3, and chose the ones with the best results according to PS. 

Precision (P), Recall (R), F1-Score (F1) and Accuracy (A) are used as evalu- 
ation measures. The scores for P, R and F1 reported over the test set on Table 4 
correspond to the Anorexia (positive) class, as this is the most relevant one, 
whereas A corresponds to the accuracy computed on both classes. Seeing that 
there are 6 times more control cases than AN and that false negative (FN) cases 
are a bigger concern compared to false positives, we prioritize R and F1 over P 
and A. This is done because as with most medical screening tasks, classifying a 
user at risk as a control case (FN) is worst than the opposite (FP), in particular 
on a classifier that is intended to be a first filter to detect users at risk and 
eventually alert clinicians, who are the ones that do an specialized screening of 
the user profile. Table 4 shows the results for the best classifiers. The best scores 
are highlighted for each measure. 

Comparing the baselines, we can notice that the embeddings based 
approaches provide an improvement on R compared to the BoW model, however 
this is given with a significant loss on P. 

Regarding the embeddings based models, our variations outperform the 
results obtained by the baselines. The model with the embeddings generated 
with our method (Variation 0) provides significantly better results compared to 
the Word2vec model (Baseline 3), and even the model with pre-learned embed- 
dings (Baseline 2), with a wider vocabulary. 

The combination of pre-learned embeddings and embeddings learned on our 
training set, provide the best results in terms of F1 and R. They also provide 
a good accuracy considering that most of the test cases are controls. We can 
also observe that using the weights of pre-learned embeddings (Variation 3) to 
start our learning process over our corpus improves significantly the R score in 
comparison to Word2vec’s generated embeddings (Baseline 3). 

The worst results for our variations are given by Variation 1 that obtains 
equivalent results to Baseline 2. The best model in terms of F1 corresponds to 
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Variation 2. Also, better results are obtained for P when the embeddings are 
enhanced by the retrofitting approach (Variation 4). 


Table 4. Baselines and enhanced embeddings evaluated in terms of Precision (P), 
Recall (R), F1-Score (F1) and Accuracy (A). 


Model Description P R Fl A Classifier 

Baseline 1 |BoW Model 90.00%/|65.85% |76.06% |94.69% MLP 

Baseline 2 |GloVe’s pre-learned 69.57% |78.05% |73.56% |92.81% MLP 
embeddings 

Baseline 3 |Word2vec embeddings 70.73% |70.73% |70.73% |92.50% SVM 

Baseline 4 |Word2vec retrofitted 71.79% |68.29% |70.00% |92.50% SVM 
embeddings 

Baseline 5 |GloVe’s pre-learned 67.35% |80.49%/|73.33% |92.50% MLP 


embeddings retrofitted 


Variation 0/Predictive pairs embeddings |77.50% |75.61% |76.54% |94.03% MLP 
(8p = 50 By = 50) 
Variation 1|Predictive pairs embeddings |69.57% |78.05% |73.56% |92.81% MLP 
+ GloVe embeddings 
Variation 2/Predictive pairs embeddings |75.00% |80.49%|77.65%/94.06% MLP 
(Bp = 50 By = 50) + GloVe 
embeddings 

Variation 3/Predictive pairs embeddings |72.73% |78.05% |75.29% |93.44% MLP 
+ GloVe embeddings 
starting weights (Gp = 0.25 
Bn = 50) 

Variation 4/Predictive pairs (Gp = 50 Bn|82.86% |70.73% |76.32% |94.37% SVM 
= 50) retrofitted embeddings 


5 Conclusions and Future Work 


We presented an approach for enhancing word embeddings towards a classifica- 
tion task on the detection of AN. Our method extends Word2vec considering 
positive and negative costs for the objective function of a target term. The 
costs are added by defining predictive terms for each of the target classes. The 
combination of the generated embeddings with pre-learned embeddings is also 
evaluated. Our results show that the usage of our enhanced embeddings outper- 
forms the results obtained by pre-learned embeddings and embeddings learned 
through Word2vec regardless of the small size of the corpus. These results are 
promising as they might lead to new research paths to explore. 

Future work involves the evaluation of the method on similar tasks, which can 
be formalized as document categorization problems, addressing small corpora. 
Also, ablation studies will be performed to assess the impact of each component 
into the results obtained. 
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Abstract. In this paper, we consider the problem of event recognition 
on single images. In contrast to conventional fine-tuning of convolutional 
neural networks (CNN), we proposed to use image captioning, i.e., a gen- 
erative model that converts images to textual descriptions. The motiva- 
tion here is the possibility to combine conventional CNNs with a com- 
pletely different approach in an ensemble with high diversity. As event 
recognition task has nothing serial or temporal, obtained captions are 
one-hot encoded and summarized into a sparse feature vector suitable 
for the learning of an arbitrary classifier. We provide the experimen- 
tal study of several feature extractors for Photo Event Collection, Web 
Image Dataset for Event Recognition and Multi-Label Curation of Flickr 
Events Dataset. It is shown that the image captions trained on the Con- 
ceptual Captions dataset can be classified more accurately than the fea- 
tures from an object detector, though they both are obviously not as 
rich as the CNN-based features. However, an ensemble of CNN and our 
approach provides state-of-the-art results for several event datasets. 


Keywords: Image captioning - Event recognition - Ensemble of 
classifiers - Convolutional neural network (CNN) 


1 Introduction 


Nowadays, social networks and mobile devices create a vast stream of multimedia 
data because people are taking more photos in recent years than ever before [1]. 
To organize a large gallery of personal photos, they may be assigned to albums 
according to some events. Social events are happenings that are attended and 
shared by the people [2,3] and take place in a specific environment [4], e.g., 
holidays, sports events, weddings, various activities, etc. The album labels are 
usually assigned either manually or by using locations from EXIF data if the 
GPS tags in a camera are switched on. However, content-based image analysis 
has been recently introduced in photo organizing systems. Such analysis can be 
© The Author(s) 2020 
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used to selectively look for photos for a particular event in order to keep nice 
memories of some episodes of our lives [4] or to gather our specific interests for 
personalized recommender systems. 

There exist two different event recognition tasks [2]. In the first task, the event 
categories are recognized for the whole album (a sequence of photos). However, 
the assignments of images of the same event into albums may be unknown in 
practice. Hence, in this paper, we focus on the second task, namely, event recogni- 
tion in single images from social media. As an event here is a complex scene with 
large variations in visual appearance [4], deep learning techniques [5] are widely 
used. It is typical to fine-tune existing convolutional neural networks (CNNs) 
on event datasets [4]. Sometimes CNN-based object detection is applied [6] for 
discovering particular categories, e.g., interior objects, food, transport, sports 
equipment, animals, etc. [7,8]. 

However, in this paper, a slightly different approach is considered. Despite the 
conventional usage of a CNN as a discriminative model in a classifier design [9], 
we propose to borrow generative models to represent an input image in the 
other domain. In particular, we use existing methods of image captioning [10] 
that generate textual descriptions of images. Our main contribution is a demon- 
stration that the generated descriptions can be fed to the input of a classifier in 
an ensemble in order to improve the event recognition accuracy of traditional 
methods. Though the proposed visual representation is not as rich as features 
extracted by fine-tuned CNNs, they are better than the outputs of object detec- 
tors [8]. As our approach is completely different than traditional CNNs, it can 
be combined with them into an ensemble that possesses high diversity and, as a 
consequence, high accuracy. 

The rest of the paper is organized as follows. In Sect. 2, the survey of image 
captioning models is given. In Sect.3, we introduce the proposed pipeline for 
event recognition based on generated captions. Experimental results for several 
event datasets are presented in Sect. 4. Finally, concluding comments and future 
works are discussed in Sect. 5. 


2 Literature Survey 


Most existing methods of event recognition on single photos tend to applica- 
tions of the CNN-based architectures [2]. Four layers of fine-tuned CNN were 
used to extract features for LDA (Linear Discriminant Analysis) classifier in 
the ChaLearn LAP 2015 cultural event recognition challenge [11]. The iterative 
selection method [4] identifies the most relevant subset of classes for transfer- 
ring representations from CNN learned from the object (ImageNet) and scene 
(Places2) datasets. The bounding boxes of detected objects are projected onto 
multi-scale spatial maps in the paper [6]. An ensemble of scene classifiers and 
object detectors provided the high accuracy [12] for the Photo Event Collection 
(PEC) [13]. Unfortunately, there is a significant gap in the accuracies of event 
classification in still photos [4] and albums [14], so that there is a huge demand 
in all-the-more accurate methods of single image processing. 
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That is why in this paper, we proposed to concentrate on other suitable 
visual features extracted with the generative models and, in particular, image 
captioning techniques. There is a wide range of applications of image captioning: 
from the automatic generation of descriptions for photos posted in social net- 
works to image retrieval from databases using generated text descriptions [15]. 
The image captioning methods are usually based on an encoder-decoder neural 
network, which first encodes an image into a fixed-length vector representation 
using pre-trained CNN, and then decodes the representation into captions (a 
natural language description). During the training of a decoder (generator), the 
input image and its ground-truth textual description are fed as inputs to the 
neural network, while one hot encoded description presents the desired network 
output. The description is encoded using text embeddings in the Embedding 
(look-up) layer [5]. The generated image and text embeddings are merged using 
concatenation or summation and form the input to the decoder part of the net- 
work. It is typical to include the recurrent neural network (RNN) layer followed 
by a fully connected layer with the Softmax output layer. 

One of the first successful models, “Show and Tell” [16], won the first MS 
COCO Image Captioning Challenge in 2015. It uses RNN with long short-term 
memory (LSTM) units in a decoder part. Its enhancement “Show, Attend and 
Tell” [17] incorporates a soft attention mechanism to improve the quality of 
the caption generation. The “Neural Baby Talk” image captioning model [18] 
is based on generating the template with slot locations explicitly tied to spe- 
cific image regions. These slots are then filled in by visual concepts identified 
in the object detectors. The foreground regions are obtained using the Faster- 
RONN network [19], and LSTM with attention mechanism serves as a decoder. 
The “Multimodal Recurrent Neural Network” (mRNN) [20] is based on the 
Inception network for image features extraction and deep RNN for sentence 
generation. One of the best models nowadays is the Auto-Reconstructor Net- 
work (ARNet) [21], which uses the Inception-V4 network [22] in an encoder, and 
the decoder is based on LSTM. There exist two pre-trained models with greedy 
search (ARNet-g) and beam search (ARNet-b) with size 3 to generate the final 
caption for each input image. 


3 Proposed Approach 


Our task can be formulated as a typical image recognition problem [9]. It is 
required to assign an input photo X from a gallery to one of C > 1 event cate- 
gories (classes). The training set of N > 1 images X = {X,,|n € {1,..., N}} with 
known event labels c, € {1,...,C} is available for classifier learning. Sometimes 
the training photos of the same event are associated with an album [13,14]. In 
such a case, the training albums are unfolded into a set X so that the collection- 
level label of the album is assigned to labels of each photo from this album. 
This task possesses several characteristics that makes it extremely challenging 
compared to album-based event recognition. One of these characteristics is the 
presence of irrelevant images or unimportant photos that can be associated with 
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any event [2]. These images can be detected by attention-based models when the 
whole album is available [1] but may have a significant negative impact on the 
quality of event recognition in single images. 

As N is usually rather small, transfer learning may be applied [5]. A deep 
CNN is firstly pre-trained on a large dataset, e.g., ImageNet or Places [23]. Sec- 
ondly, this CNN is fine-tuned on X, i.e., the last layer is replaced to the new 
layer with Softmax activations and C outputs. An input image X is classified by 
feeding it to the fine-tuned CNN to compute C scores from the output layer, i.e., 
the estimates of posterior probabilities for all event categories. This procedure 
can be modified by the extraction of deep image features (embeddings) using 
the outputs of one of the last layers of the pre-trained CNN [5,24]. The input 
image X and each training image Xn, n € {1,..., N} are fed to the input of the 
CNN, and the outputs of the last-but-one layer are used as the D-dimensional 
feature vectors x = |z1, ..., £p] and Xp = [2n-1, ..., Yn;p], respectively. Such deep 
learning-based feature extractors allow training of a general classifier Cem», e.g., 
k-nearest neighbor, random forest (RF), support vector machine (SVM) or gra- 
dient boosting [9,25]. The C-dimensional vector of Pemb = Cem»(x) confidence 
scores is predicted given the input image in both cases of fine-tuning with the 
last Softmax layer in a role of classifier C.,,5 and feature extraction with general 
classifier. The final decision is made in favor of a class with maximal confidence. 

In this paper, we use another approach to event recognition based on gener- 
ative models and image captioning. The proposed pipeline is presented in Fig. 1. 
At first, the conventional extraction of embeddings x is implemented using pre- 
trained CNN. Next, these visual features and a vocabulary V are fed to a spe- 
cial RNN-based neural network (generator) that produces the caption, which 
describes the input image. Caption is represented as a sequence of L > 0 tokens 


Input image 
Feature extraction Caption generation Vocabulary 
}___»| 
(CNN) (RNN) 
ini Capti 
Training EG ap Ba 
set preprocessing 
Classification | Text 
of embeddings classification 
Ensemble 


$ 


Predicted event category 


Fig. 1. Proposed event recognition pipeline based on image captioning 
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t = {to,t1...,¢241} from the vocabulary (tı € V, € {0,..., L}). It is generated 
sequentially, word-by-word starting from tọ =< START > token until a special 
tri1 =< END > word is produced [21]. 

The generated caption t is fed into an event classifier. In order to learn its 
parameters, every n-th image from the training set is fed to the same image 
captioning network to produce the caption tn = {tn;0, tni1---,tn:L,+1}- Since the 
number of tokens Ln is not the same for all images, it is necessary to either 
train a sequential RNN-based classifier or transform all captions into feature 
vectors with the same dimensionality. As the number of training instances N is 
not very large, we experimentally noticed that the latter approach is as accurate 
as the former, though the training time is significantly lower. This fact can be 
explained by the absence of anything temporal or serial in the initial task of 
event recognition in single images. Hence, we decided to use one-hot encoding 
and convert the sequences t and {tn } into vectors of 0s and 1s as described in [26]. 
In particular, we select a subset of vocabulary V C V by choosing the top most 
frequently occurring words in the training data {tn} with the optional exclusion 
of stop words. Next, the input image is represented as the |V |-dimensional sparse 
vector t C {0,1}!”!, where |V| is the size of reduced vocabulary V and the v-th 
component of vector t is equal to 1 only if at least one of L words in the caption 
t is equal to the v-th word from vocabulary V. This would mean, for instance, 
turning the sequence {1, 5, 10, 2} intoa V-dimensional sparse vector that would 
be all Os except for indices 1, 2, 5 and 10, which would be 1s [26]. The same 
procedure is used to describe each n-th training image with V-dimensional sparse 
vector tn. After that an arbitrary classifier Cist of such textual representations 
suitable for sparse data can be used to predict C confidence scores Pirt = Cre (t): 
It is known [26] that such an approach is even more accurate than conventional 
RNN-based classifiers (including one layer of LSTMs) for the IMDB dataset. 

In general, we do not expect that classification of short textual descriptions is 
more accurate than the conventional image recognition methods. Nevertheless, 
we believe that the presence of image captions in an ensemble of classifiers can 
significantly improve its diversity [27]. Moreover, as the captions are generated 
based on the extracted feature vector x, only one inference in the CNN is required 
if we combine the conventional general classifier of embeddings from pre-trained 
CNN and the image captions. In this paper, the outputs of individual classifiers 
are combined in simple voting with soft aggregation. In particular, we compute 
aggregated confidences as the weighted sum of outputs of individual classifier: 


Pensemble = (pi, sy DC] = U : Pemb + (1 — W) Ptet- (1) 
The decision is taken in favor of the class with maximal confidence: 


Cc’ = argmax pe. (2) 
c€{l1,...,C} 


The weight w € [0,1] in (1) can be chosen using a special validation subset 
in order to obtain the highest accuracy of criterion (2). 
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Let us provide qualitative examples for the usage of our pipeline (Fig. 1). The 
results of (correct) event recognition using our ensemble are presented in Fig. 2. 
Here the first line of the title contains the generated image caption. In addition, 
the title displays the result of event recognition using captions t (second line), 
embeddings Xemp (third line), and the whole ensemble (last line). As one can 
notice, the single classification of captions is not always correct. However, our 
ensemble is able to obtain a reliable solution even when individual classifiers 


make wrong decisions. 


a woman is doing a handstand at a local fair 
PersonalSports (texts) 
ReligiousActivity (embeddings) 
PersonalArtActivity (ensemble) 


the tower of the city 
ThemePark (texts) 
Architecture (embeddings) 
ThemePark (ensemble) 


person , a painting by person 
Museum (texts) 
UrbanTrip (embeddings) 
PersonalArtActivity (ensemble) 


(b) 


the statue ot liberty and the moon 
ThemePark (texts) 


Christmas (embeddings) 
ThemePark (ensemble) 


Fig. 2. Sample results of event recognition 
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4 Experimental Results 


In the experimental study, we examined the following event datasets: 


1. PEC [13] with 61,000 images from 807 collections of C = 14 social event 
classes (birthday, wedding, graduation, etc.). 

2. WIDER (Web Image Dataset for Event Recognition) [6] with 50,574 images 
and C = 61 events (parade, dancing, meeting, press conference, etc.). 

3. ML-CUFED (Multi-Label Curation of Flickr Events Dataset) [14] contains 
C = 23 common event types. Each album is associated with several events, 
i.e., it is a multi-label classification task. 


We used standard train/test split for all datasets proposed by their creators. 
In PEC and ML-CUFED, the collection-level label is directly assigned to each 
image contained in this collection. Moreover, we completely ignore any metadata, 
e.g., temporal information, except the image itself similarly to the paper [4]. As 
a result, the training and validation sets are not ideally balanced. The majority 
classes in each dataset contains 5-times higher number of training images when 
compared to the minority classes. However, the class distribution in the training 
and validation sets remains more or less identical, so that the number of valida- 
tion images for majority classes is also 5-times higher than the number of testing 
examples for minority classes. 

As we mainly focus on the possibility of implementing offline event recog- 
nition on mobile devices [12], in order to compare the proposed approach with 
conventional classifiers, we used MobileNet v2 with a = 1 [28] and Inception 
v4 [22] CNNs. At first, we pre-trained them on the Places2 dataset [23] for fea- 
ture extraction. The linear SVM classifier from the scikit-learn library was used 
because it has higher accuracy than other classifiers from this library (RF, k-NN, 
and RBF SVM) on the considered datasets. Moreover, we fine-tuned these CNNs 
using the given training set as follows. At first, the weights in the base part of 
the CNN were frozen, and the new head (fully connected layer with C outputs 
and Softmax activation) was learned using the ADAM optimizer (learning rate 
0.001) for 10 epochs with an early stop in the Keras 2.2 framework with the Ten- 
sorFlow 1.15 backend. Next, the weights in the whole CNN were learned during 
5 epochs using the ADAM. Finally, the CNN was trained using SGD during 3 
epochs with 10-times lower learning rate. 

In addition, we used features from object detection models that are typical 
for event recognition [6,12]. As many photos from the same event sometimes 
contain identical objects (e.g., ball in the football), they can be detected by 
contemporary CNN-based methods, i.e., SSDLite [28] or Faster R-CNN [19]. 
These methods detect the positions of several objects in the input image and 
predict the scores of each class from the predefined set of K > 1 types. We 
extract the sparse K-dimensional vector of scores for each type of object. If 
there are several objects of the same type, the maximal score is stored in this 
feature vector [8]. This feature vector is either classified by the linear SVM or 
used to train a feed-forward neural network with two hidden layers containing 
32 units. Both classifiers were learned using the training set from each event 
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dataset. In this study, we examined SSD with the MobileNet backbone and 
Faster R-CNN with the InceptionResNet backbone. The models pre-trained on 
the Open Images Dataset v4 (K = 601 objects) were taken from the TensorF low 
Object Detection Model Zoo. 

Our preliminarily experimental study with the pre-trained image captioning 
models discussed in Sect. 2 demonstrated that the best quality for MS COCO 
captioning dataset is achieved by the ARNet model [21]. Thus, in this exper- 
iment, we used ARNet’s encoder-decoder model. However, it can be replaced 
with any other image captioning technique without modification of our event 
recognition algorithm. 

Unfortunately, event datasets do not contain captions (textual descriptions), 
which are required to train or fine-tune the image captioning model. Due to 
this reason, the image captioning model was trained on the Conceptual Cap- 
tions dataset. Today this dataset is the largest dataset used for image caption- 
ing. It contains more than 3.3M image-URL and caption pairs in the training 
set, and about 15 thousand pairs in the validation set. While there exist other 
smaller datasets, such as MS COCO and Flickr, in our preliminary experiments, 
the image captioning model, which were trained on the Conceptual Captions 
Dataset, provided better worse-case performance in the cross-dataset evaluation. 

The feature extraction in the encoder is implemented not only with the same 
CNNs (Inception and MobileNet v2). We extracted |V| = 5000 most frequent 
words except special tokens < START > and < END >. They are classified by 
either linear SVM or a feed-forward neural network with the same architecture 
as for the object detection case. Again, these classifiers are trained from scratch, 
given each event training set. The weight w in our ensemble (Eq. 1) was estimated 
using the same set. 

The results of the lightweight mobile (MobileNet and SSD object detector) 
and deep models (Inception and Faster R-CNN) for PEC, WIDER and ML- 
CUFED are presented in Tables1, 2, 3, respectively. Here we added the best- 
known results for the same experimental setups. 

Certainly, the proposed recognition of image captions is not as accurate as 
conventional CNN-based features. However, the classification of textual descrip- 
tions is much better than the random guess with accuracy 100%/14 = 7.14%, 
100%/61 ~ 1.64% and 100%/23 ~ 4.35% for PEC, WIDER and ML-CUFED, 
respectively. It is important to emphasize that our approach has a lower error 
rate than the classification of the features based on object detection in most 
cases. This gain is especially noticeable for lightweight SSD models, which are 
1.5-13% less accurate than the proposed classification of image captions due to 
the limitations of SSD-based models to detect small objects (food, pets, fashion 
accessories, etc.). The Faster R-CNN-based detection features can be classified 
more accurately, but the inference in Faster R-CNN with the InceptionResNet 
backbone is several times slower than the decoding in the image captioning model 
(6-10s vs. 0.5-2s on MacBook Pro 2015). 
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Table 1. Event recognition accuracy (%), PEC 


Classifier Features Lightweight models Deep models 
SVM Embeddings 59.72 61.82 
Objects 42.18 47.83 
Texts 43.77 47.24 
Proposed ensemble (1), (2) | 60.56 62.87 
Fine-tuned CNN | Embeddings 62.33 63.56 
Objects 40.17 47.42 
Texts 43.52 46.89 
Proposed ensemble (1), (2) | 63.38 65.12 
Aggregated SVM [13] 41.4 
Bag of Sub-events [13] 51.4 
SHMM [18] 55.7 
Initialization-based transfer learning [4] 60.6 
Transfer learning of data and knowledge [4] 62.2 


Table 2. Event recognition accuracy (%), WIDER 


Classifier Features Lightweight models Deep models 
SVM Embeddings 48.31 50.48 
Objects 19.91 28.66 
Texts 26.38 31.89 
Proposed ensemble (1), (2) | 48.91 51.59 
Fine-tuned CNN | Embeddings 49.11 50.97 
Objects 12.91 21.27 
Texts 25.93 30.91 
Proposed ensemble (1), (2) | 49.80 51.84 
Baseline CNN [6] 39.7 
Deep channel fusion [6] 42.4 
Initialization-based transfer learning [4] 50.8 
Transfer learning of data and knowledge [4] 53.0 


Finally, the most appropriate way to use image captioning in event classifica- 
tion is its fusion with conventional CNNs. In such case, we improved the previous 
state-of-the-art for PEC from 62.2% [4] even for the lightweight models (63.38%) 
if the fine-tuned CNNs are used in an ensemble. Our Inception-based model is 
even better (accuracy 65.12%). We have not still reached the state-of-the-art 
accuracy 53% [4] for the WIDER dataset, though our best accuracy (51.84%) 
is up to 9% higher when compared to the best results (42.4%) from original 
paper [6]. Our experimental setup for the ML-CUFED dataset is studied for the 
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Table 3. Event recognition accuracy (%), ML-CUFED 


Classifier Features Lightweight models Deep models 
SVM Embeddings 53.54 57.27 
Objects 34.21 40.94 
Texts 37.24 41.52 
Proposed ensemble (1), (2) | 55.26 58.86 
Fine-tuned CNN | Embeddings 56.01 57.12 
Objects 32.05 40.12 
Texts 36.74 41.35 
Proposed ensemble (1), (2) | 57.94 60.01 


first time here because this dataset is developed mostly for album-based event 
recognition. We should highlight that our preliminary experiments in the lat- 
ter task with this dataset and simple averaging of MobileNet features extracted 
from all images from an album slightly improved the state-of-the-art accuracy for 
this dataset, though it is necessary to study more complex feature aggregation 
techniques [1]. 

In practice, it is preferable to use pre-trained CNN as a feature extractor in 
order to prevent additional inference in fine-tuned CNN when it differs from the 
encoder in the image captioning model. Unfortunately, the accuracies of SVM 
for pre-trained CNN features are 1.5-3% lower when compared to the fine-tuned 
models for PEC and ML-CUFED. In this case, an additional inference may be 
acceptable. However, the difference in error rates between pre-trained and fine- 
tuned models for the WIDER dataset is not significant, so that the pre-trained 
CNNs are definitely worth being used here. 


5 Conclusion 


In this paper, we have proposed to apply generative models in the classical 
discriminative task [9]; namely, image captioning in event recognition in still 
images. We have presented the novel pipeline of visual preference prediction 
using image captioning with the classification of generated captions and retrieval 
of images based on their textual descriptions (Fig. 1). It has been experimentally 
demonstrated that our approach is more accurate than the widely-used image 
representations obtained by object detectors [6,8]. Moreover, our approach is 
much faster than Faster R-CNNs, which do not implement one-shot detection. 
What is especially useful for ensemble models [27] generated caption provides 
additional diversity to conventional CNN-based recognition. 

The motivation behind the study of image captioning techniques in this 
paper is connected not only with generating compact informative descriptions 
of images, but also with the wide possibilities to ensure the privacy of user data 
if further processing at remote servers is necessary. Moreover, as the vocabulary 
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of generated captions is restricted, such techniques are considered as effective 
anonymization methods. Since the textual descriptions can be easily perceived 
and understood by the user (as opposed to a vector of numeric features), his or 
her attitude to the use of such methods will be more trustworthy. 

Unfortunately, short conceptual textual descriptions are obviously not 
enough to classify event categories with high accuracy even for a human due 
to errors and lack of specificity (see an example of generated captions in Fig. 2). 
Another disadvantage of the proposed approach is the need to repeat inference 
if fine-tuned CNN is applied in an ensemble. Hence, the decision-making time 
will be significantly increased, though the overall accuracy also becomes higher 
in most cases (Tables 1 and 3). 

In the future, it is necessary to make the classification of generated cap- 
tions more accurate. At first, though our preliminary experiments of LSTMs did 
not decrease the error rate of our simple approach with linear SVM and one- 
hot encoded words, we strongly believe that a thorough study of the RNN-based 
classifiers of generated textual descriptors is required. Second, the comparison of 
image captioning models trained on the Conceptual Captions dataset is needed 
to choose the best model for caption generation. Here the impact on event recog- 
nition accuracy arising from erroneous captions being generated should be exam- 
ined. Third, additional research is needed to check if we can fine-tune a CNN 
on an event dataset and use it as an encoder for the caption generation without 
loss of quality. In this case, a more compact and fast solution can be achieved. 
Finally, the proposed pipeline should be extended for the album-based event 
recognition [2,13] with, e.g., attention models [12]. 
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Abstract. Humans increasingly interact with Artificial intelligence (AI) 
systems. AI systems are optimized for objectives such as minimum com- 
putation or minimum error rate in recognizing and interpreting inputs 
from humans. In contrast, inputs created by humans are often treated as 
a given. We investigate how inputs of humans can be altered to reduce 
misinterpretation by the AI system and to improve efficiency of input 
generation for the human while altered inputs should remain as similar as 
possible to the original inputs. These objectives result in trade-offs that 
are analyzed for a deep learning system classifying handwritten digits. 
To create examples that serve as demonstrations for humans to improve, 
we develop a model based on a conditional convolutional autoencoder 
(CCAE). Our quantitative and qualitative evaluation shows that in many 
occasions the generated proposals lead to lower error rates, require less 
effort to create and differ only modestly from the original samples. 


1 Introduction 


Human-to-AI information flow is increasing rapidly in importance and extent 
across multiple modalities. For example, voice-machine interaction is becom- 
ing more and more popular with deep learning networks recognizing text from 
speech. Similar, the progress in image recognition has lowered error rates in ges- 
ture and optical character recognition. Still, key technologies in AI such as deep 
learning are not perfect. They might also error given ambiguous inputs created 
by humans. Errors might be more likely by humans being in a hurry, being 
unaware of the ATs recognition mechanism, sloppiness or lack of skill. Safety 
critical application areas such as autonomous driving or medical applications, 
where an AI might depend on inputs from humans in one way or another, are 
becoming more and more prominent. Thus, mistakes in recognizing and pro- 
cessing inputs should be avoided. Apart from avoiding errors, humans might 
also have an incentive to provide inputs with less effort, e.g. “Why try to speak 
clearly and loudly in the presence of noise, if mumbling works just as well? Why 
doing that extra stroke in writing a character, if detection works just as well 
without it?” In this work, we do not focus on how to improve AI systems that 
recognize and interpret human information. We aim at strategies how humans 
© The Author(s) 2020 
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can convey information better to such a system by adjusting their behavior. 
Identifying potential improvements becomes more difficult when deep learning 
is involved. Improvements are often based on a deep understanding of mecha- 
nisms of the task at hand, i.e. how an AI system processes inputs. Deep learning 
is said to follow a black-box behavior. Even worse, deep learning is well-known 
to reason very differently from humans: Deep learning models might astonish 
due to their high accuracy rates, but disappoint at the same time by failing 
on simple examples that were just slightly modified as well-documented by so 
called “adversarial examples”. As such, humans might depend even more on 
being shown opportunities for generating better data that serves as input to an 
AI. In this work, we formalize the aforementioned partially conflicting goals such 
as minimizing wrongly recognized human inputs and reducing effort for humans 
— both in terms of need to adjust their behavior as well as to interact effort- 
lessly. We focus on the classification problem of digits, where we aim to provide 
suggestions to humans by altering their generated inputs as illustrated in Fig. 1. 
We express the problem in terms of a multi-objective optimization problem, i.e. 
as a linear weighted sum. As model we use a conditional convolutional autoen- 
coder. Our qualitative and quantitative evaluation highlights that the generated 
samples are visually appealing, easy to interpret and also lead to a lower error 
rate in recognition. 


Ok, I see. 
My 9 and 6 are 
bad. For 6 and 7 
less strokes will 


© yi THIS PAPER 
Human-To-Al Coach 


- > (1) Max. Accuracy 
= H (2) Min. Effort To Create 


(3) Min. Human Change 


Fig. 1. “Human-to-AI” (H2AI) coach: From misunderstandings to understanding 


2 Challenges of Human-to-AI Communication 


We consider the problem of improving human generated inputs to an AI illus- 
trated in Fig. 1. A human wants to convey information to an AI using some mode, 
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e.g. speech, writing, or gestures. The processing of the received signals by the AI 
often involves two steps: (i) recognition, i.e. identifying and extracting relevant 
information in the input signal, and (ii) interpretation, i.e. deriving actions by 
utilizing the information in a specific context. For recognition, the information 
has to be extracted from a physical (analog) signal, e.g. using speech recognition, 
image recognition, etc. In case information is communicated in a digital manner 
using structured data, recognition is commonly obsolete. Often the extracted 
information has to be further processed by the AI using some form of sense- 
making or interpretation. The AI requires potentially semantic understanding 
capabilities and might rely on the use of context such as prior discourse or sur- 
rounding. We assume that the human interacts frequently with such a system, 
so that it is reasonable for the human to improve on objectives such as errors 
and efficiency in communication. In this paper, we consider the challenge of dis- 
covering variations of the original inputs that might help a human to improve. 
More formally, we consider a classification problem, where a user provides 
data D = (X,Y). Each sample X should be recognized as class Y by a classifier 
Cy. We denote by X; the i-th feature of sample X. For illustration, for the case 
of handwritten digits a sample X is a gray-tone scan of a digit and Y € [0 — 9] 
the digitized number. X; € [0,1] gives the brightness of the i-th pixel in the scan. 
The classification model Cy was trained to optimize classification performance 
of human samples, i.e. maximize Po„(Y |X). We regard the model Cy as a given, 
i.e. we do not alter it in any way, but use it in our optimization process. The 
Human-to-AI coach “H2AI” takes as input one sample X with its label Y. It 
returns at least one proposal X, i.e. X := H2AI(X,Y). The suggestion X should 
be superior to X according to some objective, e.g. we might demand higher 
certainty in recognition Po,,(Y|X) < Po, (Y|X). In a handwriting scenario a 
human might use a proposal X based on an input X to adjust her strokes. 


3 Model and Objectives 


An essential requirement is that the modified samples are similar to the given 
input, otherwise a trivial solution is to always return “the perfect sample” that 
is the same for any input. This motivates utilizing an auto-encoder (Sect. 3.1) 
and adding multiple loss terms to handle various objectives (Sect. 3.2). 


3.1 Architecture 


Two approaches that allow to create (modified) samples are generative adverse- 
rial networks (GANs) and autoencoders (AEs). There are also combinations 
thereof, e.g. the pix2pix architecture [10] or conditional variational autoencoder 
[2]. [10] and [2] contain an AE which has a decoder serving as a generator based 
on a latent representation from the encoder and, additionally, a discriminator. 
AE tend to generate outcomes that are closer to the inputs. But they are often 
smoother and less realistic looking. In our application staying close to the input 
is a key requirement, since we only want to show how a sample can be modified 
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rather than generating completely new samples. Thus, we decided to focus on an 
AE-based architecture. We also investigate including a discriminator to improve 
generated samples. More precisely, we utilize conditional AE with extra loss 
terms for regularization covering not only a discriminator loss but also losses for 
efficiency and classification of modified samples as shown in Fig. 3. Conditional 
AE are given as input the class of a sample in addition to the sample itself. This 


often improves generated samples, in particular for samples that are ambiguous, 
i.e. samples that seem to match multiple classes well. 


MaxPool-2,2 
Conv-128 
MaxPool-2,2 
Conv-256 
MaxPool-2,2 
Flatten 
Dense- h+10 
Conv-512 
Conv-256 
NN-upsample 
Conv-128 
NN-upsample 
Conv-64 
Conv-1 


Fig. 2. H2AI implementation using a convolutional conditional autoencoder (CCAE) 


Convolutional AE are known to work well on image data. Therefore, we 
propose convolutional conditional AE (CCAE) as shown in Fig. 2, where the NN- 
upsample layers in the decoder denote nearest-neighbor upsampling. After each 
convolutional layer, there is a ReLU layer that is not shown in Fig. 2. Compared 


to transposed convolutional layers, NN-upsampling with convolutional layers 
prevents checkerboard artifacts in the resulting images. 


Conditional Convolutional 
Autoencoder (CCAE) 


a 


Classification Model 
Human Interacts With 


@ N 
> > a I Real 


— A tT... 


+ Classification Loss + Efficiency Loss + Discriminator Loss 
v 


v 


Efficiency Discriminator 


— o] 


Autoencoder Loss 


kl 


Fig. 3. Human-to-AI (H2AI) model with its components and regularizers 
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3.2 Objectives and Loss Terms 


The generated input samples should meet multiple criteria, each of which is 
implemented as a loss term. The loss terms and their weighted sum (with param- 
eters a.) are given in Eq.1 and illustrated in Fig. 3. The total loss Dro:(X,Y) 
contains four parameters agg, ACL, AEF and ap. It is possible to keep agg and 
use the other three to control the relative importance of the following objectives: 


X := CCAE(X,Y) Sample proposed by H2AI-coach 
Lrp(X,X) z? |X: — Žil Reconstruction or Change Loss 


LoL(ğ,Y) Classification Loss 
Ler(X): => |X; | Efficiency Loss 


Lp(X) := log(1 — D(X)) Discriminator Loss 
Lrot(X,Y) := arelrn(X, X) + 0crLor(X,Y) + oprler(X) + apLp(X) 


Minimal Effort to Change: Change might be difficult and tedious for humans. 
Thus, the effort for humans to adjust their behavior should be minimized. This 
implies that the original samples X created by humans and the newly gener- 
ated variations X should be similar. This is covered by the reconstruction loss 
Lre(X,X) of the AE (see Eq. (1)). It enforces the output and the input to be 
similar. But parts of the input might be changed fairly drastically, i.e. for hand- 
written digits pixels might change from O(black) to 1(white) and vice versa. 
For that reason, we do not employ an L2-metric, which heavily penalizes such 
differences, but rather opt for an [1-metric. 


Reduce Mis-understanding: The amount of wrongly extracted or interpreted 
information by the AI should be reduced. AEs are known to have a denoising, 
averaging effect. They are also known to improve performance in some cases 
in conjunction with classification tasks [11]. To further foster a reduction in 
mis-understandings we minimize the classification loss Lc, (X,Y) for generated 
examples X for the model Cy the human communicates with. 


Realistic Samples: The generated samples X should still be comprehensible 
for humans or other systems, i.e. look realistic. It can happen that a generated 
proposal X is so optimized for the given AI model Cy that it is not meaning- 
ful in general. That is, the proposal x might appear not only very different 
from prototypical examples of its class but very different from any example 
occurring in reality. While AEs partially counteract this, AEs do not enforce 
that samples look real, but tend to create smooth (averaged) samples. Thus, we 
add a discriminator D resulting in a GAN architecture that should distinguish 
between real and generated samples and make them look crispier. The added 
discriminator loss Lp(X) is log(1 — D(X)), where X is the generated sample 
X:= CCAE(X,Y) for an input sample X of a human of class Y. 


Minimal Effort to Create Samples: Interaction should be effortless for the 
human (and AI). To quantify effort of a human to create a sample, time might 
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be a good option if available. If not, application specific measures might be more 
appropriate. For measuring effort in handwriting, the amount (and length) of 
strokes can be used. A good approximation can be the total amount of needed 
“ink”, which corresponds to the L1-loss of the proposal Ñ, i.e. Lgyp(X) := 
5 |X;|. We chose the L1 over the L2-metric, since having many low intensity 
pixels (as fostered by L2) is generally discouraged. 


4 Evaluation 


We conducted both a qualitative and quantitative evaluation on the MNIST 
dataset, since it has been used by recent work in similar contexts [6,8]. It consists 
of 50000 handwritten digits from 0 to 9 for training and 10000 digits for testing. 
The classification model Cy, i.e. the system a user is supposed to communicate 
well with, is a simple convolutional neural network (CNN) consisting of two 
convolutional layers (8 and 16 channels) that are both followed by a ReLU and 
2 x 2 Max-Pooling Layer. The last layer is a fully connected layer. The network 
achieved a test accuracy of 95.97%. While this could be improved, it is not of 
prime relevance for our problem, since the classifier Cy is treated as a given. The 
architecture of the H2AI coach is shown in Fig. 3 with details of the AE in Fig. 2 
and loss terms in Eq. 1. We did not employ any data augmentation. We used 
the AdamOptimizer with learning rate 1e—4 for all models. Training lasted for 
10 epochs with a batchsize of 8. We trained 5 networks for each hyperparameter 
setting. We perform statistical analysis of our results using t-tests. 

For the ablation study we consider adding each of the losses in isolation to 
the baseline with just the AE by varying parameters acL, QgF,@p that control 
their impact. For the AE we used agg = 32 for all experiments.’ Finally, we 
consider a model, where we add all losses. There are no fixed ranges for the 
parameters a, but they should be chosen so that all loss terms have a noticeable 
impact on the total loss — at least in the early phases of training.” 

Our qualitative analysis is a visual assessment of the generated images. We 
investigate images that were improved (in terms of each of the metrics), worsened 
and remained roughly the same. As quantitative measures we used the losses 
as defined in Eq. 1 except for classification, where we used the more common 
accuracy metric. 


4.1 Qualitative Analysis 


Figure 4 shows unmodified samples (left most column) and various configurations 
of loss weights a. We use R.x to denote “row x”. The AE (2nd column, arg = 32) 
on its own already has overall a positive impact yielding smoother images than 
the original ones. It tends to improve efficiency by removing “exotic” strokes, 


1 are is not needed (could be set to 1). But, in practice, it is easier to vary arg than 
changing acL,Q@gFr,Qp since they behave non-linearly. 

? We found that altering a during training requires much more tuning, but yields only 
modest improvements. 
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e.g. for the 2 in R.6 and the 5 in the last row, and sometimes helps also in 
improving readability (e.g. ease of classification), e.g. the 8 in R.1 row and the 6 
in the 2nd last row both become more readable. Other digits might seem more 
readable but are actually worsened, e.g. the 6 in R.6 appears to become a 0 (it is 
actually a 6) and the 7 in R.7 appears to become more of a 9. When optimizing 
in addition for efficiency (3rd column), some parts of digits get deleted, which 
is sometimes positive and sometimes negative. Some benefits of the AE seem 
to get undone, e.g. the 6 in the 2nd last row now looks again more like the 
original with missing parts. The same holds for the 8 in R.1, though for both 
some improvement in shape remains. More interestingly, the digits in R.6 both 
get changed to 0, which is incorrect. On the positive side, several figures become 
more readable through subtle changes, e.g. removals of parts like the 5 in the 
last row, the 2 in the 2nd last row or the 3 in R.3. When using the AE and the 
discriminator (4th column in Fig. 4), we can observe that the samples become 
slightly more realistic, i.e. crispier. We can see clear improvements for the 7 in 
R.7 and the 6 in R.9. Many digits remain the same. When using the AE and the 
classification loss (last column) smoothness increases and digits appear blurry. 
Readability worsens for a few digits, ie. the left 4 in R.2 can now be easily 
confused with a 9, the 6 in R.9 is no better than the original and worse than the 
one using a discriminator. Overall, the classification loss helps to improve many 
other samples. Some only now become readable, e.g. the 5 and 3 in R.8. Also 
some digits become simpler, e.g. the 1 R.1 and the 7s in R.3, R.4 and R.7. 


Ope = 32 Ope = 32 
Qa = 0.1 


Row Original Ope = 32 Op = 0.64 


me. Sel Fel ¥ 
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s UEA OE EEZ (OF 
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0 Ec] E eed Wee 
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Fig. 4. Original and generated samples using a subset of all loss terms 


438 J. Schneider 


When combining all losses (Fig.5) it can be observed that for some param- 
eters a larger values are possible to get reasonable results, since the objectives 
might counteract each other. For example, the discriminator loss pushes pixels 
to become brighter, whereas the efficiency loss pushes them to be darker. We 
noticed that the strong smoothing effect due to the classification loss is essen- 
tially removed mainly due to the discriminator loss but also partially due to the 
efficiency loss. The benefits of the classification loss, however, mainly remain and 
are also improved: The 4 in the R.2 and the 6 in R.9 become more readable. 
There are also differences in quality among the three configurations. Interest- 
ingly, the original images show somewhat more contrast, in particular compared 
to the second column. A careful observer will notice a few bright points in the 
upper part of both 4 in R.2. These seem to be artifacts of the optimization. 
It is well-known that training GANs might lead to non-convergence or mode- 
collapse. The former was observed for (too) large discriminator loss ap. We also 
noticed mode collapse for large values of acz (not shown) and bad outcomes 
for large values of agp as shown in the last column. Degenerated examples still 
score high in some of the metrics, but are very poor in others, e.g. in the last 
column accuracy and efficiency loss are good, but reconstruction loss is large. 
Still, overall combining all losses leads to best results. 


aq = 0.03 aq = 0.03 aq = 0.03 dg = 0.1 
ap = 0.64 dp = 0.16 ap = 0.16 ap = 0.64 
Opp = 4 Opp = 8 Ope = 32 


Row Original 


Ope = 2 


Fig. 5. Original and generated samples using all loss terms 
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4.2 Quantitative Analysis 


Table 1 shows the loss terms (with accuracy instead of classification loss) for all 
loss configurations also shown in Fig.4 for our ablation study with the recon- 
struction loss (AE only) as baseline. We first discuss accuracy. The AE on its 
own leads to a small gain in accuracy compared to the baseline classifier Cy 
of 95.97%. Not surprisingly, optimizing accuracy directly (using a classification 
loss, i.e. acy > 0) leads to best results: even for a seemingly small acz accuracy 
exceeds .999%. While it appears that differences in accuracy between various 
values of acz are not significant, from a statistical perspective (using a t-test) 
they are (p-value < .001). For any acz, the network tends to always fail to learn 
the same samples, leading to very low variance in accuracy. The large accuracy 
values are no surprise, since also for the test set, the network is fed the correct 
label and therefore could in principle always return a “prototypical” class sam- 
ple, ignoring all other information. When varying the efficiency loss weight app, 
accuracy decreases, but the decrease was only statistically significant for agp > 8 
(p-value < .001). Adding a discriminator also negatively impacts accuracy with 
ap > 0.64 showing statistically significant worse results (p-value < .01). 


Table 1. Results varying one loss term weight act, QEF,Q&D 


Loss Qcr|Qpr\ap | Accuracy Lrg Lgr 
Baseline (AE only) |0.0 |0.0 | 0.0 | 0.9609 0.00018 | 0.00097 
Classific. loss 0.03 |0.0 |0.0 | 0.9994 0.00027 | 0.00096 


0.08 |0.0 |0.0 | 0.9997 0.00041 | 0.00096 
0.1 |0.0 |0.0 | 0.9998 0.00042 | 0.00092 
0.24 |0.0 |0.0 | 1.0 0.00062 | 0.00085 
Efficiency loss 0.0 |1.0 |0.0 | 0.9587 0.00019 | 0.00095 
0.0 |4.0 |0.0 | 0.9607 0.00018 | 0.00093 
0.0 |8.0 |0.0 | 0.9578 0.00019 | 0.00091 
0.0 | 16.0 |0.0 | 0.9458 0.00023 | 0.00081 
0.0 | 32.0 |0.0 | 0.1135 0.00098 | <le—5 
Discrim. loss 0.0 |0.0 | 0.03 | 0.9608 0.00019 | 0.00099 
0.0 |0.0 | 0.16 | 0.96 0.0002 | 0.00096 
0.0 |0.0 | 0.64 | 0.9318 0.00032 | 0.00096 


The reconstruction loss Lrg is most tightly correlated with the visual qual- 
ity of the outcomes. In particular, large AE loss is likely to imply poor visual 
outcomes, despite the fact that other metrics such as accuracy are indicating 
good results. This can be observed in Tablel for ag, = 0.24. Generally, the 
reconstruction loss worsens when optimizing for accuracy acy, > 0 or adding a 
discriminator ap > 0. Differences to the baseline are significant (p-value < .01). 
For adding an efficiency loss differences are only significant for values agr > 8 
(p-value < .01). 
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The efficiency loss decreases when adding other losses. For the discriminator 
differences are not significant compared to the baseline, while for all other losses 
they are for any value agp and acz > 0.1 (p-value < .01). 


5 Related Work 


There are numerous types of AE. Related to our applications are denoising 
AE that are typically used through intentional noise injection with the goal of 
weight regularization. In contrast, we assume that noise is part of the input data 
and its removal is thus not motivated by regularization. The idea to combine 
AEs and GANS for image generation has been explored previously, e.g. [2] uses 
a conditional variational AE and applies it for image inpainting and attribute 
morphing. In this work, we consider a novel application of this architecture type. 
Our work is a form of image-to-image translation [10]. Typically, input and out- 
puts are fairly different, e.g. the input could be a colored segmentation of an 
image not showing any details and the output could be a photo like image with 
many details. In contrast, in our scenario in- and outputs are fairly similar. For 
image in-painting or completion [9,16] a network learns to fill in blank spaces 
of an image. In contrast, we might both in-paint and erase. Image manipula- 
tion based on user edits has been studied in [18]. They learn the natural image 
manifold using a generative adversarial network and express manipulations as 
constraint optimization problem. They apply both spatial and channel, i.e. color, 
flow regularization. Their primary goal is to obtain realistically looking images 
after manipulations. Thus, their problem and approach is fairly different. Fur- 
thermore, in contrast to the mentioned prior works [2,9,10,16,18] our work can 
be classified as unsupervised learning. That is, we do not know the final out- 
puts, i.e. the images that should be proposed to the human. Prior work trains 
by comparing their outcome to a target. In our case, we do not have pairs of 
human input (images) and improved input (images) in our training data. 

The field of human-AlI interaction is fairly broad. The effect of various user 
and system characteristics has been extensively studied [13]. There has been lit- 
tle work on how to improve communication and prevent misunderstandings. [12] 
discusses high level, non-technical strategies to deal with errors in communica- 
tion using speech that originate either from humans or from machines. [4] lists 
some errors that occur when interacting with a robot using natural language, 
such as grammatical, geometrical misunderstandings as well as ambiguities. [5] 
highlighted the impact of nonverbal communication on efficiency and robustness 
in communication. It is shown that nonverbal communication can reduce errors. 
Our work also relates to the field of personalized explanations [15]. It aims to 
explain to a user how she might improve interaction with an AI. Explainabil- 
ity in the context of machine learning is generally more focused on interpreting 
decisions and models (see [1,15] for recent surveys). Counterfactual explana- 
tions also seek to identify some form of modification of the input. [6] explains 
by answering “How to modify an input to get classification Y?” and “What 
is minimally needed?”. The former focuses on mis-classified examples with the 
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goal of changing them with minimal effort to the correct class. For the latter 
all objectives except efficiency are ignored and there is only the constraint of 
maintaining classification confidence above a threshold. Thus, [6] discusses spe- 
cial cases of our work. Technically, [6] generates a perturbation added to the 
sample such that the perturbation is minimal given a threshold confidence of 
the prediction (either as the correct class or as an alternative class) has been 
achieved. They use an ordinary AE as an optional element on the perturbation, 
which does only slightly alter results. In contrast, we use a CCAE on the inputs, 
which is essential. We optimize for multiple linear weighted objectives without 
thresholds. [8] aims at explaining counterfactuals, i.e. showing how to change 
a class to another by combining images of both classes. That is, given a query 
image and a distractor image they generate a composite image that essentially 
uses parts of each input. For instance, in the right part of Fig.6 the “7” in the 
second row serves as query image, the “2” in the middle as distractor and the 
right most column shows the outcome. The implementation relies on a gating 
mechanism to select image parts. Differences are also noticeable in the outcomes 
as shown in Fig.6. The highlighted differences appear noisy in [6] and are not 
necessarily intuitive, e.g. for column CEM-PP for digit “3” a stroke on top is 
missing, but [6] finds a miniature “3” within the given digit. The generated 
images in [8] appear more natural, but do have artifacts, e.g. the “2” being a 
composition of a “7” and a “2” has a “dot” in the bottom originating from 
the “7”. In conclusion, while counterfactual explanations [6,8] are related to our 
work, the objectives differ, e.g. we include efficiency, as well as methodology and 
outcomes. While we also make recommendations to a user, there are only weak 
ties to recommender systems. Even for interpretable recommendation systems 
[7] users typically primarily seek to understand decisions but do not commonly 
aim to alter their behavior to obtain better recommendations. 


Orig Pred CEM PP CEM PN 
Query image Distractor image Composite image 


fAs 
ARE 


Fig. 6. Left digits are taken from [6]. Right digits stem from [8]. 


6 Discussion and Conclusions 


Input from human to AI is likely to gain further in importance. This paper 
investigated improving information flow from human to AI by proposing adjust- 
ments to human generated examples based on optimizing multiple objectives. 
Our evaluation highlights that such an automatic approach is indeed feasible for 
handwriting. While we believe that our approach is suitable for other domains 
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such as speech recognition, details of the network architecture, definition of loss 
terms and the loss weights likely need to be adjusted. Furthermore, our work 
focused on generating altered input samples fulfilling specific metrics, but it 
leaves many questions unanswered when applying it. For instance, it did not 
investigate how these samples are best shown or explained to users, e.g. by 
highlighting differences or, maybe, even in textual form. These points and more 
advanced multi-objective optimization, i.e. exploring the set of (Pareto) optimal 
solutions rather than manually adjusting parameters a, are subject to future 
work. Furthermore, one might include more objectives, e.g. generating proposals 
that require little energy to process by the AI [14] or taking into account behav- 
ioral norms expected by people as common for social robots [3, 17]. We hope that 
in the future human-to-AI coaches will help non-experts to better interact with 
AI systems. 
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Abstract. Due to the steadily increasing relevance of machine learn- 
ing for practical applications, many of which are coming with safety 
requirements, the notion of uncertainty has received increasing attention 
in machine learning research in the last couple of years. In particular, 
the idea of distinguishing between two important types of uncertainty, 
often refereed to as aleatoric and epistemic, has recently been studied in 
the setting of supervised learning. In this paper, we propose to quantify 
these uncertainties, referring, respectively, to inherent randomness and a 
lack of knowledge, with random forests. More specifically, we show how 
two general approaches for measuring the learner’s aleatoric and epis- 
temic uncertainty in a prediction can be instantiated with decision trees 
and random forests as learning algorithms in a classification setting. In 
this regard, we also compare random forests with deep neural networks, 
which have been used for a similar purpose. 


Keywords: Machine learning - Uncertainty - Random forest 


1 Introduction 


The notion of uncertainty has received increasing attention in machine learn- 
ing research in the last couple of years, especially due to the steadily increas- 
ing relevance of machine learning for practical applications. In fact, a trustwor- 
thy representation of uncertainty should be considered as a key feature of any 
machine learning method, all the more in safety-critical application domains 
such as medicine [9,22] or socio-technical systems [19, 20]. 

In the general literature on uncertainty, a distinction is made between two 
inherently different sources of uncertainty, which are often referred to as aleatoric 
and epistemic [4]. Roughly speaking, aleatoric (aka statistical) uncertainty refers 
to the notion of randomness, that is, the variability in the outcome of an exper- 
iment which is due to inherently random effects. The prototypical example of 
aleatoric uncertainty is coin flipping. As opposed to this, epistemic (aka sys- 
tematic) uncertainty refers to uncertainty caused by a lack of knowledge, i.e., 
it relates to the epistemic state of an agent or decision maker. This uncertainty 
can in principle be reduced on the basis of additional information. In other 
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words, epistemic uncertainty refers to the reducible part of the (total) uncer- 
tainty, whereas aleatoric uncertainty refers to the non-reducible part. 

More recently, this distinction has also received attention in machine learn- 
ing, where the “agent” is a learning algorithm [18]. In particular, a distinction 
between aleatoric and epistemic uncertainty has been advocated in the literature 
on deep learning [6], where the limited awareness of neural networks of their own 
competence has been demonstrated quite nicely. For example, experiments on 
image classification have shown that a trained model does often fail on specific 
instances, despite being very confident in its prediction. Moreover, such models 
are often lacking robustness and can easily be fooled by “adversarial examples” 
[14]: Drastic changes of a prediction may already be provoked by minor, actually 
unimportant changes of an object. This problem has not only been observed for 
images but also for other types of data, such as natural language text [17]. 

In this paper, we advocate the use of decision trees and random forests, not 
only as a powerful machine learning method with state-of-the-art predictive per- 
formance, but also for measuring and quantifying predictive uncertainty. More 
specifically, we show how two general approaches for measuring the learner’s 
aleatoric and epistemic uncertainty in a prediction (recalled in Sect.2) can be 
instantiated with decision trees and random forests as learning algorithms in a 
classification setting (Sect.3). In an experimental study on uncertainty-based 
abstention (Sect.4), we compare random forests with deep neural networks, 
which have been used for a similar purpose. 


2 Epistemic and Aleatoric Uncertainty 


We consider a standard setting of supervised learning, in which a learner is given 
access to a set of (i.i.d.) training data D := {(a;,y;)}4_, C Æ x YV, where X is an 
instance space and J the set of outcomes that can be associated with an instance. 
In particular, we focus on the classification scenario, where Y = {y1,... YK} 
consists of a finite set of class labels, with binary classification (Y = {0,1}) as 
an important special case. 

Suppose a hypothesis space H to be given, where a hypothesis h € H is a 
mapping ¥ — P(Y), i.e., a hypothesis maps instances x € Æ to probability 
distributions on outcomes. The goal of the learner is to induce a hypothesis 
h* € H with low risk (expected loss) 


R(h) = a Mola). s) 4 Pw. 9), (1) 


where P is the (unknown) data-generating process (a probability distribution 
on XY x V), and £: VY x Y — R a loss function. This choice of a hypothesis is 
commonly guided by the empirical risk 


Remplh) = z D e(h(e),0), 2 
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i.e., the performance of a hypothesis on the training data. However, since 
Remp(h) is only an estimation of the true risk R(h), the empirical risk mini- 
mizer (or any other predictor) 


h := argmin Remp(h) (3) 
hEH 
favored by the learner will normally not coincide with the true risk minimizer 
(Bayes predictor) 
h* := argmin R(h). (4) 
hEH 

Correspondingly, there remains uncertainty regarding h* as well as the approx- 
imation quality of h (in the sense of its proximity to h*) and its true risk R(h). 
Eventually, one is often interested in the predictive uncertainty, i.e., the uncer- 
tainty related to the prediction 94 for a concrete query instance x, E€ X. In other 
words, given a partial observation (a,,-), we are wondering what can be said 
about the missing outcome, especially about the uncertainty related to a pre- 
diction of that outcome. Indeed, estimating and quantifying uncertainty in a 
transductive way, in the sense of tailoring it to individual instances, is arguably 
important and practically more relevant than a kind of average accuracy or 

confidence, which is often reported in machine learning. 


F=y* 


hypothesis 
space 
oF 


x 


Fig. 1. Different types of uncertainties related to different types of discrepancies and 
approximation errors: f* is the pointwise Bayes predictor, h* is the best predictor 
within the hypothesis space, and h the predictor produced by the learning algorithm. 


As the prediction y, constitutes the end of a process that consists of different 
learning and approximation steps, all errors and uncertainties related to these 
steps may also contribute to the uncertainty about y, (cf. Fig. 1): 


— Since the dependency between ¥ and y} is typically non-deterministic, the 
description of a new prediction problem in the form of an instance x, gives 
rise to a conditional probability distribution 


P(Xq,¥) (5) 


P(y|&q) = pe) 
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on y, but it does normally not identify a single outcome y in a unique way. 
Thus, even given full information in the form of the measure P (and its 
density p), uncertainty about the actual outcome y remains. This uncertainty 
is of an aleatoric nature. In some cases, the distribution (5) itself (called the 
predictive posterior distribution in Bayesian inference) might be delivered 
as a prediction. Yet, when having to commit to a point estimate, the best 
prediction (in the sense of minimizing the expected loss) is prescribed by the 
pointwise Bayes predictor f*, which is defined by 


f(a) = argmin | Ky.) APUA) (6) 
JEY JY 
for each æ € X. 

— The Bayes predictor (4) does not necessarily coincide with the pointwise 
Bayes predictor (6). This discrepancy between h* and f* is connected to the 
uncertainty regarding the right type of model to be fit, and hence the choice 
of the hypothesis space H. We refer to this uncertainty as model uncertainty. 
Thus, due to this uncertainty, one can not guarantee that h*(a) = f*(æ), or, 
in case the hypothesis h* delivers probabilistic predictions p(y | h*, a) instead 
of point predictions, that p(-|h*,x) = p(-| x). 

— The hypothesis h produced by the learning algorithm, for example the empir- 
ical risk minimizer (3), is only an estimate of h*, and the quality of this esti- 
mate strongly depends on the quality and the amount of training data. We 
refer to the discrepancy between h and h*, i.e., the uncertainty about how 
well the former approximates the latter, as approximation uncertainty. 


As already said, aleatoric uncertainty is typically understood as uncertainty that 
is due to influences on the data-generating process that are inherently random, 
that is, due to the non-deterministic nature of the sought input/output depen- 
dency. This part of the uncertainty is irreducible, in the sense that the learner 
cannot get rid of it. Model uncertainty and approximation uncertainty, on the 
other hand, are subsumed under the notion of epistemic uncertainty, that is, 
uncertainty due to a lack of knowledge about the perfect predictor (6). Obvi- 
ously, this lack of knowledge will strongly depend on the underlying hypothesis 
space H as well as the amount of data seen so far: The larger the number N = |D| 
of observations, the less ignorant the learner will be when having to make a new 
prediction. In the limit, when N — oo, a consistent learner will be able to iden- 
tify h*. Moreover, the “larger” the hypothesis pace H, i.e., the weaker the prior 
knowledge about the sought dependency, the higher the epistemic uncertainty 
will be, and the more data will be needed to resolve this uncertainty. 

How to capture these intuitive notions of aleatoric and epistemic uncertainty 
in terms of quantitative measures? In the following, we briefly recall two pro- 
posals that have recently been made in the literature. 


2.1 Entropy Measures 


An attempt at measuring and separating aleatoric and epistemic uncertainty on 
the basis of classical information-theoretic measures of entropy is made in [2]. 
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This approach is developed in the context of neural networks for regression, but 
the idea as such is more general and can also be applied to other settings. A 
similar approach was recently adopted in [10]. 

Given a query instance æ, the idea is to measure the total uncertainty in a 
prediction in terms of the (Shannon) entropy of the predictive posterior distri- 
bution, which, in the case of discrete J, is given as 


H{p(y|&)] = Epy la) { — logs ply læ)} =- X` ply|@) log, p(y|x). (7) 
yey 


Moreover, the epistemic uncertainty is measured in terms of the mutual infor- 
mation between hypotheses and outcomes (i.e., the Kullback-Leibler divergence 
between the joint distribution of outcomes and hypotheses and the product of 
their marginals): 


Tysh) = Eryn (0g, (202) }, (8) 


Finally, the aleatoric uncertainty is specified in terms of the difference between 
(7) and (8), which is given by 


Epn|p) [pul h,#)] =— 1 p(hID) | Y piu lh, æ)logaplulh æ) | dh (9) 
H yey 


The idea underlying (9) is as follows: By fixing a hypothesis h € H, the epis- 
temic uncertainty is essentially removed. Thus, the entropy H[p(y|h,a)], i.e., 
the entropy of the conditional distribution on Y predicted by h for the query 
instance æ, is a natural measure of the aleatoric uncertainty. However, since h 
is not precisely known, aleatoric uncertainty is measured in terms of the expec- 
tation of this entropy with regard to the posterior probability p(h | D). 

The epistemic uncertainty (8) captures the dependency between the prob- 
ability distribution on y and the hypothesis h. Roughly speaking, (8) is high 
if the distribution p(y|h,a) varies a lot for different hypotheses h with high 
probability. This is plausible, because the existence of different hypotheses, all 
considered (more or less) probable but leading to quite different predictions, can 
indeed be seen as a sign for high epistemic uncertainty. 

Obviously, (8) and (9) cannot be computed efficiently, because they involve 
an integration over the hypothesis space H. One idea, therefore, is to approx- 
imate these measures by means of ensemble techniques [10], that is, to rep- 
resent the posterior distribution p(h|D) by a finite ensemble of hypotheses 
H = {hy,..., hy}. An approximation of (9) can then be obtained by 


M 
1 
tala) = -77 J >, P(y| his 2) logy ply | hi, æ), (10) 
i=1 yEy 
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an approximation of (7) by 


1 M 1 M 
ur (x) — 5 (ii Py | =) log» (Fe ly | =) ’ (11) 


yey 


and finally and approximation of (8) by ue(a) := u(x) — uals). 


2.2 Measures Based on Relative Likelihood 


Another approach, put forward in [18], is based on the use of relative likelihoods, 
historically proposed by [1] and then justified in other settings such as possibility 
theory [21]. Here, we briefly recall this approach for the case of binary classifica- 
tion, i.e., where Y = {0,1}; see [13] for an extension to the case of multinomial 
classification. 

Given training data D = {(aj, yi) }4_, C Æ x YV, the normalized likelihood of 
h € H is defined as 

L(h) L(h) 


a L(h™) — maxnen L(h’)’ m 


where L(h) = JAM p(y; | h, £;) is the likelihood of h, and h™ € H the maximum 
likelihood estimation. For a given instance gx, the degrees of support (plausibility) 
of the two classes are defined as follows: 


m1|a) = ey min [ru (h), p |h,w) — p(O|h, x)|, (13) 
m(O| a) = sup min [īru (h), p(0 |h,w) —p(1|h, x)|. (14) 


So, m(1|æ) is high if and only if a highly plausible hypothesis supports the 
positive class much stronger (in terms of the assigned probability) than the 
negative class (and 7(0|a) can be interpreted analogously). Given the above 
degrees of support, the degrees of epistemic and aleatoric uncertainty are defined 
as follows: 


Ue(x) 


= min [x(1|x),7(0|a)], (15) 
Ua(w@) =1 


— max |7(1 |æ), (0 |æ)]. (16) 


Thus, epistemic uncertainty refers to the case where both the positive and the 
negative class appear to be plausible, while the degree of aleatoric uncertainty 
(16) is the degree to which none of the classes is supported. More specifically, 
the above measures have the following properties: 


— ue(x) will be high if class probabilities strongly vary within the set of plau- 
sible hypotheses, i.e., if we are unsure how to compare these probabilities. In 
particular, it will be 1 if and only if we have h(a) = 1 and h’(a) = 0 for two 
totally plausible hypotheses h and h’; 
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— Uq(a) will be high if class probabilities are similar for all plausible hypotheses, 
i.e., if there is strong evidence that h(a) ~ 0.5. In particular, it will be close to 
1 if all plausible hypotheses allocate their probability mass around h(a) = 0.5. 


As can be seen, the measures (15) and (16) are actually quite similar in spirit 
to the measures (8) and (9). 


3 Random Forests 


Our basic idea is to instantiate the (generic) uncertainty measures presented in 
the previous section by means of decision trees [15,16], that is, with decision 
trees as an underlying hypothesis space H. This idea is motivated by the fact 
that, firstly, decision trees can naturally be seen as probabilistic predictors [7], 
and secondly, they can easily be used as an ensemble in the form of a random 
forest—recall that ensembling is needed for the (approximate) computation of 
the entropy-based measures in Sect. 2.1. 


3.1 Entropy Measures 


The approach in Sect. 2.1 can be realized with decision forests in a quite straight- 
forward way. Let H = {h1,..., hm} be a classifier ensemble in the form of a 
random forest consisting of decision trees h;. Moreover, recall that a decision 
tree h; partitions the instance space X into (rectangular) regions Ri1,..., Ri, 
(i.e., Dran Riı = X and Ri, N Riz = 9 for k # l) associated with corresponding 
leafs of the tree (each leaf node defines a region R). Given a query instance gz, 
the probabilistic prediction produced by the tree h; is specified by the Laplace- 
corrected relative frequencies of the classes y € Y in the region Ri j 5 x: 


nij(y) +1 


hy, £ = , 
ply | hi, 2) mj tD 


where n; j is the number of training instances in the leaf node R; j, and ni,;(y) 
the number of instances with class y. With probabilities estimated in this way, 
the uncertainty degrees (10) and (11) can directly be derived. 


3.2 Measures Based on Relative Likelihood 


Instantiating the approach in Sect. 2.2 essentially means computing the degrees 
of support (13-14), from which everything else can easily be derived. 

As already said, a decision tree partitions the instance space into several 
regions, each of which can be associated with a constant predictor. More specif- 
ically, in the case of binary classification, the predictor is of the form hg, 
0 € O = [0,1], where họ(x) = 0 is the (predicted) probability p(1|a € R) 
of the positive class in the region. If we restrict inference to a local region, the 
underlying hypothesis space is hence given by H = {hg |0 <6 < 1}. 
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With p and n the number of positive and negative instances, respectively, 
within a region R, the likelihood and the maximum likelihood estimate of 0 are 
respectively given by 


L(6) = ey 6"(1— 6)? ande™ = —” (17) 


n+p 


Therefore, the degrees of support for the positive and negative classes are 


m(1|a) = sup min ( Ly” 20 1) , (18) 


9 (0,1] a) ae 


7(0|a) = sup mia ( POT 1 2), (19) 


6¢(0.1] zio 


Solving (18) and (19) comes down to maximizing a scalar function over a 
bounded domain, for which standard solvers can be used. From (18-19), the 
epistemic and aleatoric uncertainty associated with the region R can be derived 
according to (15) and (16), respectively. For different combinations of n and p, 
these uncertainty degrees can be pre-computed. 

Note that, for this approach, the uncertainty degrees (15) and (16) can be 
obtained for a single tree. To leverage the ensemble H, we average both uncer- 
tainties over all trees in the random forest. 


4 Experiments 


The empirical evaluation of methods for quantifying uncertainty is a non-trivial 
problem. In fact, unlike for the prediction of a target variable, the data does 
normally not contain information about any sort of “ground truth” uncertainty. 
What is often done, therefore, is to evaluate predicted uncertainties indirectly, 
that is, by assessing their usefulness for improved prediction and decision mak- 
ing. Adopting an approach of that kind, we produced accuracy-rejection curves, 
which depict the accuracy of a predictor as a function of the percentage of rejec- 
tions [5]: A classifier, which is allowed to abstain on a certain percentage p of 
predictions, will predict on those (1 — p)% on which it feels most certain. Being 
able to quantify its own uncertainty well, it should improve its accuracy with 
increasing p, hence the accuracy-rejection curve should be monotone increasing 
(unlike a flat curve obtained for random abstention). 


4.1 Implementation Details 


For this work, we used the Random Forest Classifier from SKlearn. The number 
of trees within the forest is set to 50, with the maximum level of tree grows set 
to 10. We use bootstrapping to create diversity between the trees of the forest. 

As a baseline to compare with, we used the DropConnect model for deep 
neural networks as introduced in [10]. The idea of DropConnect is similar to 
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Dropout, but here, instead of randomly deleting neurons, we randomly delete the 
connections between neurons. In this model, the act of dropping the connections 
is also active in the test phase. In this way, the data passes through a different 
network on each iteration, and therefore we can compute Monte Carlo samples 
for each query instance. The DropConnect model is a feed forward neural network 
consisting of two DropConnect layers with 32 neurons and a final softmax layer 
for the output. The model is trained for 20 epochs with mini batch size of 32. 
After the training is done, we take 50 Monte Carlo samples to create an ensemble, 
from which the uncertainty values can be calculated. 


4.2 Results 


Due to space limitations, we show results in the form of accuracy-rejection curves 
for only two exemplary data sets from the UCI repository!, spect and diabetes— 
yet, very similar results were obtained for other data sets. The data is randomly 
split into 70% for training and 30% for testing, and accuracy-rejection curves 
are computed on the latter (the curves shown are averages over 100 repetitions). 
In the following, we abbreviate the aleatoric and epistemic uncertainty degrees 
produced by the entropy-based approach (Sect. 2.1) and the approach based on 
relative likelihood (Sect. 2.2) by AU-ent, EU-ent, AU-rl, and EU-rl, respectively. 


o 20 40 60 80 o 20 40 60 80 
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Fig. 2. Accuracy-rejection curves for aleatoric (above) and epistemic (below) uncer- 
tainty using random forests. The curve for random rejection is included as a baseline. 


1 https://archive.ics.uci.edu/ml/datasets/. 
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As can be seen from Figs. 1, 2, 3 and 4, both approaches to measuring uncer- 
tainty are effective in the sense of producing monotone increasing accuracy- 
rejection curves, and on the data sets we analyzed so far, we could not detect 
any systematic differences in performance. Besides, rejection seems to work well 
on the basis of both criteria, aleatoric as well as epistemic uncertainty. This is 
plausible, since both provide reasonable reasons for a learner to abstain from 
a prediction. Likewise, there are no big differences between random forests and 
neural networks, showing that the former are indeed a viable alternative to the 
latter—this was actually a major concern of our study. 
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Fig. 3. Scatter plot for test set on diabetes data, showing the relationship between the 
uncertainty degrees (aleatoric left, epistemic right) estimated by the two approaches. 
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Fig. 4. Comparison between random forests and neural networks (DropConnect) for 
aleatoric (above) and epistemic (below) in the entropy-based uncertainty approach. 
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5 Conclusion 


The distinction between aleatoric and epistemic uncertainty has recently received 
a lot of attention in machine learning, especially in the deep learning community 
[6]. Roughly speaking, the approaches in deep learning are either based on the 
idea of equipping networks with a probabilistic component, like in Bayesian deep 
learning [11], or on using ensemble techniques [8], which can be implemented 
(indirectly) through techniques such as Dropout [3] or DropConnect. The main 
purpose of this paper was to show that the use of decision trees and random 
forests is an interesting alternative to neural networks. 

Indeed, as we have shown, the basic ideas underlying the estimation of 
aleatoric and epistemic uncertainty can be realized with random forests in a 
very natural way. In a sense, they even appear to be simpler and more flexi- 
ble than neural networks. For example, while the approach based on relative 
likelihood (Sect.2.2) could be realized efficiently for random forests, a neural 
network implementation is far from obvious (and was therefore not included in 
the experiments). 

There are various directions for future work. For example, since the hyper- 
parameters of random forests have an influence on the hypothesis space we 
are (indirectly) working with, they also influence the estimation of uncertainty 
degrees. This relationship calls for a thorough investigation. Besides, going 
beyond a proof of principle with statistics such as accuracy-rejection curves, 
it would be interesting to make use of uncertainty quantification with random 
forests in applications such as active learning, as recently proposed in [12]. 
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Abstract. Machine learning models deployed in real-world applications 
are often evaluated with precision-based metrics such as F1-score or 
AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent 
on the class prior, such metrics make it difficult to interpret the variation 
of a model’s performance over different subpopulations/subperiods in a 
dataset. In this paper, we propose a way to calibrate the metrics so that 
they can be made invariant to the prior. We conduct a large number of 
experiments on balanced and imbalanced data to assess the behavior of 
calibrated metrics and show that they improve interpretability and pro- 
vide a better control over what is really measured. We describe specific 
real-world use-cases where calibration is beneficial such as, for instance, 
model monitoring in production, reporting, or fairness evaluation. 


Keywords: Performance metrics - Class imbalance - Precision-recall 


1 Introduction 


In real-world machine learning systems, the predictive performance of a model is 
often evaluated on multiple datasets, and comparisons are made. These datasets 
can correspond to sub-populations in the data, or different periods in time [15]. 
Choosing the best suited metrics is not a trivial task. Some metrics may prevent 
a proper interpretation of the performance differences between the sets [8, 14], 
especially because different datasets generally not only have a different likelihood 
P(z|y) but also a different class prior P(y). A metric dependent on the prior (e.g. 
precision) will be affected by both differences indiscernibly [3] but a practitioner 
could be interested in isolating the variation of performance due to likelihood 
which reflects the intrinsic model’s performance (see illustration in Fig. 1). Take 
the example of comparing the performance of a model across time periods: At 
time t, we receive data drawn from P(x,y) = P:(aly)P:(y) where x are the 
features and y the label. Hence the optimal scoring function (i.e. model) for this 
dataset is the likelihood ratio [11]: 


P:(z|y = 1) 


Pi(zly=0) g 


s(x) := 
In particular, if P¿(x|y) does not vary with time, neither will s,(a). In this case, 
even if the prior P;(y) varies, it is desirable to have a performance metric M(-) 
© The Author(s) 2020 
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satisfying M (s+, P+) = M(si41, P41), Vt so that the model maintains the same 
metric value over time. That being said, this does not mean that dependence to 
prior is an intrinsically bad behavior. Some applications seek this property as it 
reflects a part of the difficulty to classify on a given dataset (e.g. the performance 
of the random classifier evaluated with a prior-dependent metric is more or less 
high depending on the skew of the dataset). 


Soa — AUC PR | 0-00375 
0.5 4 > n 0.00350 
z a S 0.00325 
& o4 — an 0.00300 
©) = 
= PS 0.00275 
T o3 oe i 0.00250 
ey ee a 0.00225 
0.2 f S55 0.00200 
“s+0.00175 
0 1 2 3 4 5 6 7 
Time 


Fig. 1. Evolution of the AUC-PR of a fraud detection system and of the fraud ratio (r, 
i.e. the empirical P:(y)) over time. Both decrease, but, as the AUC-PR is dependent 
on the prior, it does not allow to tell if the performance variation is only due to the 
variation of m or if there was a drift in P:(z|y) 


In binary classification, researchers often rely on the AUC-ROC (Area Under 
the Curve of Receiver Operating Characteristic) to measure a classifier’s perfor- 
mance [6,9]. While this metric has the advantage of being invariant to the class 
prior, many real-world applications, especially when data are imbalanced, have 
recently begun to favor precision-based metrics such as AUC-PR and F-Score 
[12,13]. The reason is that AUC-ROC suffers from giving false positives too lit- 
tle importance [5] although the latter strongly deteriorate user experience and 
waste human efforts with false alerts. Indeed AUC-ROC considers a tradeoff 
between TPR and FPR whereas AUC-PR/F1-score consider a tradeoff between 
TPR (Recall) and Precision. With a closer look, the difference boils down to the 
fact that it normalizes the number of false positives with respect to the number 
of true negatives whereas precision-based metrics normalize it with respect to 
the number of true positives. In highly imbalanced scenarios (e.g. fraud/disease 
detection), the first is much more likely than the second because negative exam- 
ples are in large majority. 

Precision-based metrics give false positives more importance, but they are 
tied to the class prior [2,3]. A new definition of precision and recall into preci- 
sion gain and recall gain has been recently proposed to correct several drawbacks 
of AUC-PR [7]. But, while the resulting AUC-PR Gain has some advantages 
of the AUC-ROC such as the validity of linear interpolation between points, 
it remains dependent on the class prior. Our study aims at providing metrics 
(i) that are precision-based to tackle problems where the class of interest is highly 
under-represented and (ii) that can be made independent of the prior for com- 
parison purposes (e.g. monitoring the evolution of the performance of a classifier 
across several time periods). To reach this objective, this paper provides: (1) A 
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formulation of calibration for precision-based metrics. It compute the value of 
precision as if the ratio 7 of the test set was equal to a reference class ratio 7. 
We give theoretical arguments to explain why it allows invariance to the class 
prior. We also provide a calibrated version for precision gain and recall gain [7]. 
(2) An empirical analysis on both synthetic and real-world data to confirm our 
claims and show that new metrics are still able to assess the model’s performance 
and are easier to interpret. (3) A large scale experiments on 614 datasets using 
openML [16] to (a) give more insights on correlations between popular metrics 
by analyzing how they rank models, (b) explore the links between the calibrated 
metrics and the regular ones. 

Not only calibration solves the issue of dependence to the prior but also 
allows, with parameter mro, anticipating a different ratio and controlling what 
the metric precisely reflects. This new property has several practical interests 
(e.g. for development, reporting, analysis) and we discuss them in realistic use- 
cases in Sect. 5. 


2 Popular Metrics for Binary Classification: Advantages 
and Limits 


We consider a usual binary classification setting where a model has been trained 
and its performance is evaluated on a test dataset of N instances. y; € {0,1} is 
the ground-truth label of the it? instance and is equal to 1 (resp. 0) if the instance 
belongs to the positive (resp. negative) class. The model provides s; € R, a score 
for the it! instance to belong to the positive class. For a given threshold 7 € R, 
the predicted label is J; = 1 if s; > 7 and 0 otherwise. Predictive performance 
is generally measured using the number of true positives (TP = = | L(y; = 
1, y; = 1)), true negatives (TN = - L(y; = 0, yi = 0)), false positives (FP = 
EX (G: = 1, yi = 0)), false negatives (FN = 37%, 1(9 = 0, yi = 1)). One can 
compute relevant ratios such as the True Positive Rate (TPR) also referred to 
as the Recall (Rec = TP EN): the False Positive Rate (FPR = TEP) also 
referred to as the Fall-out and the Precision (Prec = ee): As these ratios 
are biased towards a specific type of error and can easily be manipulated with the 
threshold, more complex metrics have been proposed. In this paper, we discuss 
the most popular ones which have been widely adopted in binary classification: 
F1-Score, AUC-ROC, AUC-PR and AUC-PR Gain. F1-Score is the harmonic 
average between Prec and Rec: 


2* Prec * Rec 

f= Prec+ Rec ` (2) 

The three other metrics consider every threshold 7 from the highest s; to the 
lowest. For each one, they compute TP, FP, TN and FN. Then, they plot one 
ratio against another and compute the Area Under the Curve (Fig. 2). AUC-ROC 
considers the Receiver Operating Characteristic curve where TPR is plotted 
against FPR. AUC-PR considers the Precision vs Recall curve. Finally, in AUC- 
PR Gain, the precision gain (Preca) is plotted against the recall gain (Reca). 
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N i 
They are defined in [7] as follows (7 = Bett is the positive class ratio and we 
always consider that it is the minority class in this paper): 


Prec— rm 
Preca = —_——_ (3) 
(1 — 1)Prec 
Rec- r 
Reca = ————— (4) 
(1 — 1) Rec 
m = 0.003 
ROC curve PR curve PR Gain curve 
id c 
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Fig. 2. ROC, PR and PR gain curves for the same model evaluated on an extremely 
imbalanced test set from a fraud detection application (m = 0.003, in the top row) and 
on a balanced sample (m = 0.5, in the bottom row). 


PR Gain enjoys many properties of the ROC that the regular PR analysis does 
not (e.g. the validity of linear interpolations or the existence of universal baselines) 
[7]. However, AUC-PR Gain becomes hardly usable in extremely imbalanced set- 
tings. In particular, we can derive from (3) and (4) that Precg/ Recg will be mostly 
close to 1 if 7 is close to 0 (see top right chart in Fig. 2). 


High scores Low scores 


Case (i) 
pna enonenean OCOnn | tou 


(E) Positive instance 


Case (ii) oun B E) |E E) en P Negative instance 


Ratio n,< n 


Same recall, same false positive rate, lower precision 


Fig. 3. Illustration of the impact of m on precision, recall, and the false positive rate. 
Instances are ordered from left to right according to their score given by the model. 
The threshold is illustrated as a vertical line between the instances: those on the left 
(resp. right) are classified as positive (resp. negative) 
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As explained in the introduction, precision-based metrics (F1, AUC-PR) are 
more adapted than AUC-ROC for problems with class imbalance. On the other 
hand, only AUC-ROC is invariant to the positive class ratio. Indeed, FPR and 
Rec are both unrelated to the class ratio because they only focus on one class 
but it is not the case for Prec. Its dependency on the positive class ratio 7 
is illustrated in Fig.3: when comparing a case (i) with a given ratio 7 and 
another case (ii) where a randomly selected half of the positive examples has 
been removed, one can visually understand that both recall and false positive 
rate are the same but the precision is lower in the second case. 


3 Calibrated Metrics 


We seek a metric that is based on Prec to tackle problems where data are 
imbalanced and the minority (positive) class is the one of interest but we want 
it to be invariant w.r.t. the class prior to be able to interpret its variation across 
different datasets (e.g. different time periods). To obtain such a metric, we will 
modify those based on Prec (AUC-PR, F1-Score and AUC-PR Gain) to make 
them independent of the positive class ratio 7. 


3.1 Calibration 


The idea is to fix a reference ratio 7) and to weigh the count of TP or FP in 
order to calibrate them to the value that they would have if 7 was equal to ro. 
To can be chosen arbitrarily (e.g. 0.5 for balanced) but it is preferable to fix 
it according to the task at hand (we analyze the impact of mo in Sect.4 and 
describe simple guidelines to fix it in Sect. 5). 

If the positive class ratio is 79 instead of 7, the ratio between negative exam- 


ples and positive examples is multiplied by ru To} In this case, we expect the 
ratio between false positives and true positives to be multiplied by a 


Therefore, we define the calibrated precision Prec, as follows: 


TP 1 
Prec, = a = TEE (5) 
+ toll- 7) + mo(1—7) TP 
Since += is the imbalance ratio Ne where N} (resp. N_) is the number of 
positive (resp. negative) examples, we have: 7% a = TN = FER which is 


independent of r. 

Based on the calibrated precision, we can also define the calibrated F1-score, 
the calibrated Precg and the calibrated Recg by replacing Prec by Prec, and 
t by To in Eqs. (2), (3) and (4). Note that calibration does not change precision 
gain. Indeed, calibrated precision gain eas can be rewritten as a SPs 
which is equal to the regular precision gain. Also, the interesting properties of 
the recall gain were proved independently of the ratio m in [7] which means that 
calibration preserves them. 
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3.2 Robustness to Variations in 7 


In order to evaluate the robustness of the new metrics to variations in 7, we 
create a synthetic dataset where the label is drawn from a Bernoulli distribution 
with parameter m and the feature is drawn from Normal distributions: 


p(aly = 1; p1) = N (z; 41,1), p(aly = 0; po) = N (z; uo, 1) (6) 


Performance measure 


0.5 0.4 0.3 0.2 0.1 0.0 
m 


--@-- AUC-PR —e— AUC-P.R +A- AUC-PR Gain —+— AUC-P.R Gain =æ- Fl score —m® Fl scorec 


Fig. 4. Evolution of AUC-PR, AUC-PR Gain, Fl-score and their calibrated version 
(AUC-P.R, AUC-P-R Gain, Fl-scorec) as m decreases. We arbitrarily set mo = 0.5 for 
the calibrated metrics. The curves are obtained by averaging results over 30 runs and 
we show the confidence intervals. 


For several values of 7, data points are generated from (6) with uı = 2 and 
Ho = 1.8. We consider a large number of points (10°) so that the empirical class 
ratio m is approximately equal to the Bernouilli parameter m. We empirically 
study the evolution of several metrics (f\-score, AUC-PR, AUC-PR Gain and 
their calibrated version) for the optimal model (as defined in (1)) as 7 decreases 
from 7 = 0.5 (balanced) to m = 0.001. We observe that the impact of the class 
prior on the regular metrics is important (Fig. 4). It can be a serious issue for 
applications where 7 sometimes vary by one order of magnitude from one day 
to another (see [4] for a real world example) as it leads to a significant variation 
of the measured performance (see the difference between AUC-PR when 7 = 0.5 
and when 7 = 0.05) even if the optimal model remains the same. On the contrary, 
the calibrated versions remain very robust to changes in the class prior 7 even 
for extreme values. Note that we here experiment with synthetic data to have 
a full control over the distribution/prior and make the analysis easier but the 
conclusions are exactly the same on real world data.! 


' See appendix in https: / /figshare.com/articles/Calibrated_metrics_IDA_Supplement- 
ary_material_pdf/11848146. 
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3.3 Assessment of the Model Quality 


Besides the robustness of the calibrated metrics to changes in 7, we also want 
them to be sensitive to the quality of the model. If this latter decreases regardless 
of the m value, we expect all metrics, calibrated ones included, to decrease in 
value. Let us consider an experiment where we use the same synthetic dataset as 
defined the previous section. However, instead of changing the value of m only, 
we change (H1, Ho) to make the problem harder and harder and thus worsen the 
optimal model’s performance. This can be done by reducing the distance between 
the two normal distributions in (6), because this would result in more overlapping 
between the classes and make it harder to discriminate between them. As a 
distance, we consider the KL-divergence that boils down to $(j11 — po)”. 


Performance measure 


2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 
KL(P1, Po) 


-@-- AUC-PR —e AUC-P.R sve» AUC-PR Gain —#— AUC-P,R Gain =æ- Fl score —® F1scorec 


Fig. 5. Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as 
KL (pi, po) tends to 0 and as m randomly varies. This curve was obtained by averaging 
results over 30 runs. 


Figure 5 shows how the values of the metrics evolve as the KL-divergence 
gets closer to zero. For each run, we randomly chose the prior 7 in the interval 
(0.001, 0.5]. As expected, all metrics globally decrease as the problem gets harder. 
However, we can notice an important difference: the variation in the calibrated 
metrics are smooth and monotonic compared to those of the original metrics 
which are affected by the random changes in 7. In that sense, variations of the 
calibrated metrics across the different generated datasets are much easier to 
interpret than the original metrics. 


4 Link Between Calibrated and Original Metrics 


4.1 Meaning of mo 


Let us first remark that for test datasets in which 7 = mo, Prec, is equal to the 
regular precision Prec since a = 1 (this is observable in Fig.4 with the 
intersection of the metrics for 7 = mo = 0.5). 
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e formula 
— heuristic mean 


AUC-PR 


Fig. 6. Comparison between heuristic-based calibrated AUC-PR (red line) and our 
closed-form calibrated AUC-PR (blue dots). The red shadow represents the stan- 
dard deviation of the heuristic-based calibrated AUC-PR over 1000 runs. (Color figure 
online) 


If t Æ To, the calibrated metrics essentially have the value that the original 
ones would have if the positive class ratio 7 was equal to 7. To further demon- 
strate that, we compare our proposal for calibration (5) with the only proposal 
from the past [10] that was designed for the same objective: a heuristic-based 
calibration. The approach from [10] consists in randomly undersampling the test 
set to make the positive class ratio 7 equal to a chosen ratio (let us refer to it 
as 7 for the analogy) and then computing the regular metrics on the sampled 
set. Because of the randomness, sampling may remove more hard examples than 
easy examples so the performance can be over-estimated, and vice versa. To 
avoid that, the approach performs several runs and computes a mean estima- 
tion. In Fig.6, we compare the results obtained with our formula and with their 
heuristic, for several reference ratio 7, on a highly unbalanced (m = 0.0017) 
credit card fraud detection dataset available on Kaggle [4]. 

We can observe that our formula and the heuristic provide really close val- 
ues. This can be theoretically explained (See Footnote 1) and confirms that our 
formula really computes the value that the original metric would have if the 
ratio 7 in the test set was 7. Note that our closed-form calibration (5) can be 
seen as an improvement of the heuristic-based calibration from [10] as it directly 
provides the targeted value without running a costly Monte-Carlo simulation. 


4.2 Do the Calibrated Metrics Rank Models in the Same Order 
as the Original Metrics? 


Calibration results in evaluating the metric for a different prior. In this section, 
we analyze how this impacts the task of selectioning the best model for a given 
dataset. To do this, we empirically analyze the correlation of several metrics 
in terms of model ordering. We use OpenML [16] to select the 602 supervised 
binary classification datasets on which at least 30 models have been evaluated 
with a 10-fold cross-validation. For each one, we randomly choose 30 models, 
fetch their predictions, and evaluate their performance with the metrics. This 
leaves us with 614 x 30 = 18,420 different values for each metric. To analyze 
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whether they rank the models in the same order, we compute the Spearman 
rank correlation coefficient between them for the 30 models for each of the 614 
problems.” Most datasets roughly have balanced classes (7 > 0.2 in more than 
90% of the datasets). Therefore, to also specifically analyze the imbalance case, 
we run the same experiment with only the subset of 4 highly imbalanced datasets 
(m < 0.01). The compared metrics are AUC-ROC, AUC-PR, AUC-PR Gain 
and the best F1-score over all possible thresholds. We also add the calibrated 
version of the last three. In order to understand the impact of mo, we use two 


different values: the arbitrary mo = 0.5 and another value mo ~ m (for the first 


experiment with all datasets, mo œ~ m corresponds to mo = 1.017 and for the 


second experiment where 7 is very small, we go further and mo œ% m corresponds 
to mo = 107 which remains closer to m than 0.5). The obtained correlation 
matrices are shown in Fig. 7. Each individual cell corresponds to the average 
Spearman correlation over all datasets between the row metric and the column 


metric. 


AUC-PR 


AUC-PRig=n 
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Fig. 7. Spearman rank correlation matrices between 10 metrics over 614 datasets for 
the left figure and the 4 highly imbalanced datasets for the right figure. 


A general observation is that most metrics are less correlated with each other 
when classes are unbalanced (right matrix in Fig. 7). We also note that the best 
F1-score is more correlated to AUC-PR than to AUC-ROC or AUC-PR Gain. In 
the balanced case (left matrix in Fig. 7), we can see that metrics defined as area 
under curves are generally more correlated with each other than with the thresh- 
old sensitive classification metric F1-score. Let us now analyze the impact of cal- 


ibration. As expected, in general, when 79 ~ 7, calibrated metrics have a behav- 


ior really close to that of the original metrics because * se m OT) w 1 and therefore 


? The implementation of the paper experiments can be found at https: //github.com/ 
wissam-sib/calibrated_metrics. 
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Prec. © Prec. In the balanced case (left), since 7 is close to 0.5, calibrated metrics 
with 79 = 0.5 are also highly correlated with the original metrics. In the imbalanced 
case (on the right matrix of Fig. 7), when 7 is arbitrarily set to 0.5 the calibrated 
metrics seem to have a low correlation with the original ones. In fact, they are less 
correlated with them than with AUC-ROC. And this makes sense given the rela- 
tive weights that each of the metric applies to FP and TP. The original precision 
gives the same weight to TP and FP, although false positives are + —* times more 


likely to occur (+= > 100 if m < 0.01). The calibrated precision with the arbi- 


= ils O see i i 1-7 4; 
trary value 79 = 0.5 boils down to p7 SFP and gives a weight ——* times 


smaller to false positives which counterbalances their higher likelihood. ROC, like 
the calibrated metrics with 7 = 0.5, gives += less weight to FP because it is com- 


puted from FPR and TPR which are linked to TP and FP with the relationship 
FP _ FPR 
Tx TP Z TPR 

To sum up the results, we first emphasize that the choice of the metrics to 


rank classifiers when datasets are rather balanced seems to be much less sensitive 
than in the extremely imbalanced case. In the balanced case the least correlated 
metrics have an average rank correlation of 0.81. For the imbalanced datasets, 
on the other hand, many metrics have low correlations which means that they 
often disagree on the best model. The choice of the metric is therefore very 
important here. Our experiment also seems to reflect that rank correlations are 
mainly a matter of how much weight is given to each type of error. Choosing 
these “weights” generally depends on the application at hand. An this should be 
remembered when using calibration. To preserve the nature of a given metrics, 
To has to be fixed to a value close to 7 and not arbitrarily. The user still has the 
choice to fix it to another value if his purpose is to specifically place the results 
into a different reference with a different prior. 


5 Guidelines and Use-Cases 


Calibration could benefit ML practitioners when analyzing the performance of a 
model across different datasets/time periods. Without being exhaustive, we give 
four use-cases where it is beneficial (setting 7 depends on the target use-case): 


Comparing the Performance of a Model on Two Populations /Classes: 
Consider a practitioner who wants to predict patients with a disease and evalu- 
ate the performance of his model on subpopulations of the dataset (e.g. children, 
adults and elderly people). If the prior is different from one population to another 
(e.g. elderly people are more likely to have the disease), precision will be affected, 
i.e. population with a higher disease ratio will be more likely to have a higher 
precision. In this case, the calibrated precision can be used to obtain the preci- 
sion of each population set to the same reference prior (for instance, mo can be 
chosen as the average prior over all populations). This would provide an addi- 
tional balanced point of view and make the analysis richer to draw more precise 
conclusions and perhaps study fairness [1]. 
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Model Performance Monitoring in an Industrial Context: In systems 
where a model’s performance is monitored over time with precision-based met- 
rics like F1-score, using calibration in addition to the regular metrics makes it 
easier to understand the evolution especially when the class prior can evolve (cf. 
application in Fig. 1). For instance, it can be useful to analyze the drift (i.e. dis- 
tinguish between variations linked to 7 or P(X|y)) and design adapted solutions; 
either updating the threshold or completely retraining the model. To avoid dena- 
turing too much the Fl-score, here mo has to be fixed based on realistic values 
(e.g. average m in historical data). 


Establishing Agreements with Clients: As shown in previous sections, 7 
can be interpreted as the ratio to which we refer to compute the metric. This 
can be useful to establish a guarantee, in an agreement, that will be robust 
to uncontrollable events. Indeed, if we take the case of fraud detection, the 
real positive class ratio 7 can vary extremely from one day to another and on 
particular events (e.g. fraudster attacks, holidays) which significantly affects the 
measured metrics (see Fig. 4). Here, after having both parties to agree beforehand 
on a reasonable value for 79 (based on their business knowledge), calibration will 
always compute the performance relative to this ratio and not the real 7 and 
thus be easier to guarantee. 


Anticipating the Deployment of a Model in Production: Imagine one 
collects a sample of data to develop an algorithm and reaches an acceptable AUC- 
PR for production. If the prior in the collected data is different from reality, 
the non-calibrated metric might have given either a pessimistic or optimistic 
estimation of the post-deployment performance. This can be extremely harmful 
if the production has strict constraints. Here, if the practitioner uses calibration 
with 7 equal to the minimal prior envisioned for the application at hand, he/she 
would be able to anticipate the worst case scenario. 


6 Conclusion 


In this paper, we provided a formula of calibration, empirical results, and guide- 
lines to make the values of metrics across different datasets more interpretable. 
Calibrated metrics are a generalization of the original ones. They rely on a refer- 
ence 7 and compute the value that we would obtain if the positive class ratio 7 
in the evaluated test set was equal to 7. If the user chooses mo = 7, this does not 
change anything and he retrieves the regular metrics. But, with different choices, 
the metrics can serve several purposes such as obtaining robustness to variation 
in the class prior across datasets, or anticipation. They are useful in both aca- 
demic and industrial applications as explained in the previous section: they help 
drawing more accurate comparisons between subpopulations, or study incremen- 
tal learning on streams by providing a point of view agnostic to virtual concept 
drift [17]. They can be used to provide more controllable performance indicators 
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(easier to guarantee and report), help preparing deployment in production, and 
prevent false conclusions about the evolution of a deployed model. However, 7o 
has to be chosen with caution as it controls the relative weights given to FP and 
TP and, consequently, can affect the selection of the best classifier. 
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Abstract. We propose a new word embedding model, called SPhrase, 
that incorporates supervised phrase information. Our method modifies 
traditional word embeddings by ensuring that all target words in a phrase 
have exactly the same context. We demonstrate that including this infor- 
mation within a context window produces superior embeddings for both 
intrinsic evaluation tasks and downstream extrinsic tasks. 


Keywords: Phrase embeddings - Named entity recognition - Natural 
language processing 


1 Introduction 


Word embeddings represent words with multidimensional vectors that are used 
in various models for applications such as, named entity recognition [9], query 
expansion [13], and sentiment analysis [21]. These embeddings are usually gen- 
erated from a huge corpus with unsupervised learning models [3,16, 18, 23, 24]. 
These models are based on describing target words by their neighbouring words 
which are also considered as contexts. The selection of these context words is 
generally linear (i.e. n words surrounding the target). Alternatively, arbitrary 
context words were used in [16] where context selection is based on the syntactic 
dependencies to the target word. 

These models treat words as lexical units and create a context window sur- 
rounding a target word. This approach can be problematic when the context 
window for a target word contains only part of a phrase. For example, consider 
a scenario where a target word is close to (and to the right of) the named entity 
“George W. Bush” but the context window only retains the word “George”. 
Clearly this will generate ambiguity as the independent word “George” may 
refer another person (George Washington), location (George Street, Oxford) or 
a music band (George). To deal with the issue described above, [19] used a data- 
driven approach to identify and treat these phrases as individual tokens. While 
this technique may learn a phrase representation it cannot learn a representation 
of the individual words that comprise the phrase. 

In our approach we obtain phrase information directly from Wikipedia. Terms 
from Wikipedia articles are formatted as hyperlinks to relevant articles. In a 
related method [22] these terms are extracted as named entities. This paper 
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interprets these terms as phrases. By using Wikipedia for phrase information 
(unlike [16]) we avoid needing additional grammatical information. This also 
gives us the potential to generate multi-lingual embeddings, although we do not 
pursue this here. 

In this work, we are using phrase boundary information to generate word 
embedding in a non-compositional manner rather than a phrase embedding. We 
consider each of the words in the phrase as a part of the unit, where a unit 
can either be single word (i.e. not a link in the Wikipedia) or otherwise a bag 
of words. The embeddings are then learned for each of the unit members by 
considering surrounding units in the context. 

In the following section we present related work in this domain, Sect. 3 
presents our model and in Sects. 4 to 6 we give details of the implementation 
and the experiments. 


2 Related Work 


Word representations can be obtained from a language model where the goal is 
to predict a future word based on some previously observed information such 
as, a sentence, a sequence, or a phrase. For this task, various models can be 
utilised including: joint probabilities of observation that may include the Markov 
assumption. Under this assumption, we may say that the immediate future is 
independent of the entire past given the present. N-gram language models [4] 
use this assumption to predict token(s) using the previous N — 1 tokens [17]. 
This can be constructed efficiently for very large datasets using neural network 
based language modelling (NNLM) [2]. 

The NNLM of [2] used a non-linear hidden layer between the input and 
output layers. A simpler network named the log bi-linear model was introduced 
in [20] by dropping the hidden layer between input and output layer. Instead 
of the hidden layer, context vectors were summed and projected to the output 
layer. This model was later used by [18] and named CBOW (Continuous Bag- 
of-words model), with a symmetric context (i.e. context words on both sides of 
the target word). 

In addition, the Skip-gram model, was introduced in this work by reversing 
CBOW to predict context from the target word. Given a context range c and 
target word w+ the objective is to maximise the average log probability, 


5 log p(we+;|we) 


~c<j<e 

The model defines p(w;+,;|w;) using the softmax function, 
1 T 

exp (u o Vw J 
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where vy and v’, are the “input” and “output” vector representations of w, and 
W is the number of words in the vocabulary. However, due to the large vocabu- 
lary, the computation becomes impractical. Thus, Noise Contrastive Estimation 
(NCE) [7] was used that performs the same operation by sampling a very small 
amount of words k from the vocabulary as noise. 

A similar technique is called Candidate Sampling [10] that combines noise sam- 
ples with the true class, denoted as the set S, with the objective to predict the true 
class from it, where Y is a set of true classes. Embeddings are scored as, 


Y, = (X, * W, + bs) — log(E(s)). 


Where Xs is a vector (embedding) corresponding to a word s € S, Ws is 
the corresponding weight, bs is the bias, and E(s) is the expectation for s. Each 
score is approximated to a probability using the softmax function, 


exp Y, 
Dees exp Ys 


In addition to words, phrases may also be considered. In [18], the words 
comprising a phrase were joined using the delimiter ‘_’ between them, and their 
joint embedding was learned. This scheme is called non-compositional embedding 
[8,26]. Alternatively, compositional embeddings [8] are generated by merging 
word embeddings of phrase components using a composition function. The main 
difference in these schemes is that the previous learns the phrase embeddings 
while the latter just merges already learned word embeddings to make the phrase 
embeddings. Similarly, [3] introduced an extension of the Skip-gram model [18] 
that composes sub-word embeddings to make word embeddings with summation 
as the composition function. 


Softmax(Y,) = 


3 The SPhrase Model 


The proposed model uses information about which words belong to which 
phrases. This information can be conveniently represented as simply the loca- 
tions for where phrases start and end, hence the name, Supervised Phrase Bound- 
ary Representations model (SPhrase). 

The key assumption is that each word that comprises a phrase has the same 
context. This will produce an embedding where words that occur in the same 
phrase are likely to be close in the vector space. For example consider the sen- 
tence: 


British Airways to New York has Departed 

This sentence includes the (noun) phrase ‘New York’. Following the procedure 
for Word2vec we focus on the target word ‘New’ using a context window of 
size 1. The target, context pairs are (New, to) and (New, York). Repeating this 
procedure for the target word ‘York’, yields the target, context pairs (York, New) 
and (York, has). 
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For SPhrase, the context differs from Word2vec, both target words in ‘New 
York’ will have the same context based on the words immediately surrounding 
the phrase, hence the SPhrase target context pairs are (New, to), (New, has), 
(York, to), (York, has). Figure 1 highlights the context words for the word ‘New’ 
for both Word2vec and SPhrase. 


Word2vec 


British airways to New York has departed 


SPhrase 


British airways to New York has departed 


Fig. 1. Context words for the target word New using Word2vec and SPhrase. The 
context words are in bold. The context size is 1. 


In the above, we demonstrated the target context pairs induced by a target 
word that is a member of a phrase, where its context are individual words. In the 
following, we generalise the approach to handle the situation where phrases are 
part of a context. We do this by introducing the concept of a unit, where a unit 
consist of a sequence of words. A unit of length 1 represents individual words, a 
unit of length 2 represents two word phrases and so on for larger phrases. 

Thus we measure the context simply in terms of units. Figure2 provides an 
example of a context of size 2 each side. Note that the left context for SPhrase 
contains 3 words. Thus the context size measured in words will be larger for 
SPhrase than Word2vec if there is a phrase within the context window. 


Word2vec 


British airways to Rome has departed 


SPhrase 


British airways to Rome has departed 


Fig. 2. Context words for the target word Rome using Word2vec and SPhrase. The 
context words are in bold. The context size is 2. 
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3.1 SPhrase Context Sampling 


A standard approach to reduce the computation involved in generating embed- 
dings is to shorten the effective context length by using only a sample of words 
from a context [18]. For SPhrase this can be achieved in several ways. First it 
can be done at the level of units not words, this is denoted unit context sampling 
(SPhrase). Second random word context sampling (R)! involves first performing 
unit context sampling, then for each unit that has a length greater than one only 
one word is sampled uniformly at random. This yields an effective context length 
that matches the context length of Word2vec. In addition to that, we generate 
embeddings named without unit context sampling (NU) where the target still is 
a unit but the context comprises individual words. 


4 Methods and Datasets 


4.1 Dataset 


In order to generate an embedding using our approach, we require a corpus that 
has phrases annotated. Unfortunately this is not readily available, so we use a 
proxy for phrase annotation. In datasets that include hyperlinks we assume that 
the hyperlink displayed text is a phrase. One such data set is Wikipedia; we use 
the English Wikipedia dump version 20180920 that contains over 3 billion tokens. 
The proportion of tokens in phrases of length 2 is 2.5%; of length 3, 4, 5, and 
greater is respectively 0.8%, 0.3%, 0.2%, and less than 0.1%. Obviously not all 
phrases are represented as hyperlink text and not all hyperlink texts are phrases. 
Indeed the longest hyperlink text in our data set is of length 16,382 (it included 
internal formatting of Wikipedia). For our study we restricted maximum length 
to 10. The embedding vocabulary contained tokens with a frequency of at least 
100 which gave us a total of 400,919 distinct tokens. 


4.2 Parameter Settings 


Training is performed in mini-batches of 60,000 tokens per batch with candidate 
sampling of 5000 classes per batch (value dictated by the available computational 
resource). The remaining parameters use standard values, the learning rate is ini- 
tialised to 0.001 and optimisation is based on Adam optimiser [12] for stochastic 
learning. The learning decay is set to 10% (i.e. learning rate * 0.9) after each epoch. 
The total number of the epochs is set to 20. The weighting scheme for selecting 
words in the context sampling is the same as for Word2vec [18]. 


5 Evaluation 


There are two types of evaluation tasks commonly accepted: intrinsic and extrin- 
sic. Intrinsic evaluation tasks determine the quality of embeddings. Under this 


' Pretrained embeddings are available at: https://github.com/ManniSingh/SPhrase. 


Supervised Phrase-Boundary Embeddings 475 


class, word similarity /relatedness tasks are generally based on cosine distance as 
a metric to find similarity between two word vectors. Extrinsic evaluation tasks, 
on the other hand, are based on specific downstream tasks such as, named entity 
recognition (NER), sentiment classification, topic detection. In this work, we are 
doing similarity based intrinsic evaluation and NER based extrinsic evaluation. 


6 Experimental Design 


6.1 Intrinsic Evaluation 


The following experiments fit into the so-called intrinsic category of embedding 
evaluation. We aim to demonstrate that although the total number of phrases in 
our dataset is small compared to the number of words, they do have a positive 
impact on the resulting embeddings. In order to determine an optimal configu- 
ration of the method, intrinsic evaluation is done on embeddings trained on the 
first 10% of the corpus; see Fig.3, As a result, the extrinsic evaluation described 
Sect.6.2, the performance of the optimal configuration in this evaluations is: 
SPhrase (R) with window size 5. For the extrinsic evaluation only the optimal 
configuration is used and the embeddings are trained on the full corpus. 

In the following experiments we compare SPhrase embeddings with the ones 
generated by Word2vec. It is known that increasing the context window size gen- 
erally improves the quality of the embedding. Recall that the expected context 
size for each target word is the same for Word2vec and SPhrase due to word 
context sampling. 

We expect that words in phrases should be mapped to similar locations in 
the embedding, i.e. words within a phrase should be closer together than words 
that are not in the same phrase. In the following we first perform experiment 
on pairwise similarity and then we investigate further structure with an analogy 
task. 


Pairwise Similarity. For pairwise similarity experiments we use phrases from 
three datasets. 


— CoNLL-2003 English dataset [25]. From this dataset multi-word named enti- 
ties were extracted. These are used as phrases, in total there are 12,999. The 
maximum phrase length is 7 in this dataset, so we restricted the following 
two datasets to this as well. 

— From our Wikipedia training corpus we obtained 16,470 phrases from the first 
1,000000 tokens. This dataset comes from our training data, so we assume we 
should obtain good results in this case. 

— Bristol [15] - from this dataset we selectively used the entity list and found 
87,209 phrases. 
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Similarity scores comparison for the phrases relative to 100 random words 


representing: unit context sampling (SPhrase), Without unit context sampling (NU) 
and, with random word context sampling (R). Where SPhrase (in bold) and Word2vec 
(dashed) are compared on phrase lengths 2-7 (in horizontal axis) with higher the score 


the better it 


performed. 


In order to investigate how the distances of words within a phrase compare to 
distances of words with random words in the datasets we use the following, 


where, 


Similarity Score 


b(wi, Wi41,7) = 


0 


-1 


1 
== NAD b(wi, Wi+1, r) 


i=1 


(wi, Wit1) > s(wi,r), 
otherwise, 
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where r is a word selected at random from another phrase. A new word is drawn 
for each phrase pair comparison. The similarity score is calculated 100 times and 
the overall average is taken in order to reduce the noise generated by selecting 
only one word for each comparison. The interpretation of this is similar to the 
cosine score in that the larger the value the better. 

We computed scores for phrase lengths up to and including length 7. We 
have used context window sizes 3, 5 and 10. Figure3 shows these scores for 
the context sampling regimes: with unit context sampling, without unit context 
sampling, and word context sampling. 

We can see that regardless of the embedding, the scores in general reduce as 
the phrase gets longer. However, the larger the window size the more Word2vec 
and SPhrase agree. This is what we should expect, since there will be greater 
overlap in the context words between SPhrase and Word2vec. Nevertheless we 
see that, overall, SPhrase performs better. 


Google Analogy Test Set. Analogy based tasks are widely used, e.g. [5,6, 11] 
to evaluate the quality of word embeddings. One well known test set is the 
Google analogy test set [18]. This dataset comprises rows of four words, such 
as known unknown informed uninformed. The analogy task is to predict the 
final word using the first three using simple vector addition/subtraction of their 
vector representations. Informally the task attempts to show how well words 
follow the vector relationship 

unknown - known = uninformed - informed 


Table 1. Scores on Google analogy dataset with unit context sampling (SPhrase), here 
accuracy is the total correct count on the total count of instances. 


Accuracy - displayed to 3 decimal places Count 

Window size 3 Window size 5 Window size 10 

SPhrase| Word2vec|SPhrase|Word2vec|SPhrase | Word2vec 
capital-world 0.727 0.628 0.746 0.658 0.815 0.782 4524 
capital-common-countries 0.872 0.848 0.941 0.856 0.976 0.941 506 
city-in-state 0.660 0.480 0.715 0.583 0.647 0.677 2467 
gram3-comparative 0.848 0.806 0.758 0.813 0.643 0.670 332 
gram2-opposite 0.223 0.220 0.220 0.222 0.206 0.204 812 
gram8-plural 0.755 0.736 0.715 0.744 0.641 0.727 332 
gram4-superlative 0.379 0.396 0.345 0.366 0.279 0.262 122 
gram9-plural-verbs 0.639 0.559 0.536 0.546 0.453 0.521 870 
gram6-nationality-adjective | 0.846 0.784 0.838 0.815 0.854 0.853 599 
family 0.603 0.595 0.595 0.638 0.581 0.543 506 
gram/7-past-tense 0.472 0.515 0.474 0.492 0.441 0.470 560 
currency 0.047 0.042 0.021 0.021 0.018 0.016 866 
graml-adjective-to-adverb | 0.104 0.087 0.119 0.121 0.132 0.148 992 
gram5-present-participle 0.517 0.520 0.509 0.486 0.479 0.455 056 
all 0.601 0.545 0.597 0.565 0.581 0.587 19544 
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Table 2. Scores on Google analogy dataset without unit context sampling (NU), here 
accuracy is the total correct count on the total count of instances. 


Accuracy - displayed to 3 decimal places Count 

Window size 3 Window size 5 Window size 10 

SPhrase| Word2vec|SPhrase | Word2vec|SPhrase | Word2vec 
capital-world 0.671 0.628 0.725 0.658 0.744 0.782 4524 
capital-common-countries 0.881 0.848 0.935 0.856 0.929 0.941 506 
city-in-state 0.653 0.480 0.645 0.583 0.652 0.677 2467 
gram3-comparative 0.706 0.806 0.696 0.813 0.519 0.670 332 
gram2-opposite 0.217 0.220 0.197 0.222 0.172 0.204 812 
gram8-plural 0.726 0.736 0.712 0.744 0.661 0.727 332 
gram4-superlative 0.273 0.396 0.298 0.366 0.269 0.262 122 
gram9-plural-verbs 0.577 0.559 0.548 0.546 0.477 0.521 870 
gram6-nationality-adjective | 0.855 0.784 0.821 0.815 0.827 0.853 599 
family 0.569 0.595 0.553 0.638 0.502 0.543 506 
gram7-past-tense 0.453 0.515 0.483 0.492 0.414 0.470 560 
currency 0.039 0.042 0.024 0.021 0.028 0.016 866 
gram1l-adjective-to-adverb | 0.130 0.087 0.173 0.121 0.168 0.148 992 
gram5-present-participle 0.511 0.520 0.509 0.486 0.492 0.455 056 
all 0.565 0.545 0.576 0.565 0.553 0.587 19544 


The dataset is divided into categories, some of which are inherently phrase- 

based. In the category capital-common-countries a typical line is: 

Athens Greece Baghdad Iraq 

Both Athens Greece and Baghdad Iraq can be reasonably construed to be phrases, 
unlike in the first example above. Two other categories have this same character, 
namely capital-world and city-in-state. 

Example rows are: Athens Greece Canberra Australia and 

Chicago Illinois Houston Texas respectively. 

With this in mind we show the accuracy of SPhrase and Word2vec stratified 
by category, in addition to the overall accuracy that is usually reported. The 
categories that have a phrasal quality are italicised in Tables 1, 2 and 3. We see 
that, overall, SPhrase performs better in these categories. 


6.2 Extrinsic Evaluation 


We use Conll2003 English [25] and Wikigold [1] to evaluate the performance of 
the embeddings generated. The Conll dataset is widely used to evaluate various 
NER based models. It contains 203,621 tokens in the training set, while valida- 
tion and test set contains 51,362 and 46,435 tokens respectively. On the other 
hand, Wikigold provides a single data file of 39,007 tokens that we used for test- 
ing while the NER models were trained with Conll train and validation data. 
We used SPhrase (R) model with window size 5 since this configuration demon- 
strated significant improvements over Word2vec as shown in Fig. 3. We recreated 
the BLSTMs and CRF based model [14] but without any feature engineering. 
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Table 3. Scores on Google analogy dataset with random word context sampling (R), 
here accuracy is the total correct count on the total count of instances. 


Accuracy - displayed to 3 decimal places Count 

Window size 3 Window size 5 Window size 10 

SPhrase| Word2vec | SPhrase | Word2vec|SPhrase | Word2vec 
capital-world 0.637 0.628 0.718 0.658 0.766 0.782 4524 
capital-common-countries |0.858 0.848 0.903 0.856 0.953 0.941 506 
city-in-state 0.664 0.480 0.623 0.583 0.663 0.677 2467 
gram3-comparative 0.845 0.806 0.803 0.813 0.682 0.670 332 
gram2-opposite 0.224 0.220 0.245 0.222 0.196 0.204 812 
gram8-plural 0.772 0.736 0.731 0.744 0.655 0.727 332 
gram4-superlative 0.373 0.396 0.392 0.366 0.257 0.262 122 
gram9-plural-verbs 0.575 0.559 0.586 0.546 0.474 0.521 870 
gram6-nationality-adjective | 0.818 0.784 0.824 0.815 0.831 0.853 599 
family 0.615 0.595 0.581 0.638 0.595 0.543 506 
gram7-past-tense 0.479 0.515 0.520 0.492 0.460 0.470 560 
currency 0.040 0.042 0.024 0.021 0.023 0.016 866 
graml-adjective-to-adverb | 0.090 0.087 0.127 0.121 0.172 0.148 992 
gram5-present-participle 0.526 0.520 0.455 0.486 0.479 0.455 056 
all 0.576 0.545 0.588 0.565 0.576 0.587 19544 


Table 4. Comparison of Word2vec with SPhrase(NU) on Conll2003 English and 
Wikigold dataset 


Model Conll2003Eng | Wikigold 
Word2Vec 83.82 + 0.3831 | 55.49 + 0.4708 
SPhrase 88.93+0.1115 66.01 + 0.4172 


We trained this in 20 epochs with evaluating on validation data each time. We 
performed 10 instances for each of these models and presented the range of F1 
scores (using Conll2003 evaluation script). Table 4 displays the results that show 
a significant improvement over the Word2vec model trained on the same corpus. 


7 Concluding Remarks 


This investigation demonstrates that using phrasal information can directly 
enrich word embeddings. In this work, we presented an alternative context sam- 
pling technique to that used in skip-gram Word2vec. We note that the SPhrase 
approach is not limited to augmenting Word2Vec, it can also be applied to mor- 
phological extensions such as Fasttext [3]. 

We used the displayed text from hyperlinks as a proxy for phrases, and in 
this sense SPhrase is supervised. We are, however, planning to generalise the 
methodology by investigating whether we can identify useful phrase boundaries 
in a completely unsupervised fashion. 
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Abstract. Prognostics is the area of research that is concerned with 
predicting the remaining useful life of machines and machine parts. The 
remaining useful life is the time during which a machine or part can 
be used, before it must be replaced or repaired. To create accurate pre- 
dictions, predictive techniques must take external data into account on 
the operating conditions of the part and events that occurred during its 
lifetime. However, such data is often not available. Similarity-based tech- 
niques can help in such cases. They are based on the hypothesis that if a 
curve developed similarly to other curves up to a point, it will probably 
continue to do so. This paper presents a novel technique for similarity- 
based remaining useful life prediction. In particular, it combines Bayesian 
updating with priors that are based on similarity estimation. The paper 
shows that this technique outperforms other techniques on long-term 
predictions by a large margin, although other techniques still perform 
better on short-term predictions. 


Keywords: Remaining useful life - Trajectory based similarity 
prediction - Bayesian updating - Similarity estimation - Prognostics + 
Prediction 


1 Introduction 


Prognostics is the area of research that concerns the prediction of the remaining 
useful life (RUL) of machines or machine parts. A RUL prediction is a prediction 
of the time until a machine or machine part must be replaced or repaired. It is 
important that such predictions are accurate: early predictions lead to unneces- 
sarily frequent maintenance with associated costs, while late predictions increase 
the risk of a machine break down with associated loss of production time and 
possibly sales. 

Data-driven RUL prediction is based on run to failure data, i.e., observations 
on what happened to a part or machine in a run from the last maintenance 
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activity to the next. Figure 1 shows a typical example of run to failure data, 
in this case data of a filter in a chemical plant. The figure shows condition 
measurements on the filter over time, in terms of the difference in pressure before 
and after the filter. It shows that this difference is close to zero for some time. 
Then, the filter starts to clog up and the pressure builds up, until the filter is 
replaced and the pressure difference returns to normal. The resulting ‘sawtooth’ 
shape is frequently observed in run to failure data. 


N 


Differential Pressure (bar) 


l li ! 
0 Jan Apr Jul Oct Jan 
Month 


Fig. 1. Example run to failure data. 


RUL prediction on run to failure data can be done by fitting a model, such 
as a regression model or a probability distribution, on the data. Many differ- 
ent techniques exist for those purposes [1]. However, as is evident from Fig. 1, 
different runs may have very different durations or shapes, and RUL prediction 
techniques rely on additional data to accurately predict the duration and shape 
of a particular run. Unfortunately, additional data is often unavailable or hard to 
relate to the run to failure data [2]. If additional data is unavailable, it is unclear 
which condition measurements are reliable and of course what their influence 
is on the RUL. One way to overcome these problems is to use similarity-based 
techniques, which work based on the hypothesis that, if a curve has developed 
similarly to some collection of other curves until now, it will likely continue to 
develop like that, and have a similar remaining useful life. 

This paper explores the performance of two similarity-based techniques: 
trajectory-based similarity prediction, and Bayesian updating. It then adds its 
own: Bayesian updating with similarity-based priors. The contribution of this 
paper consists of this technique, described in Sect. 3.4, as well as a detailed eval- 
uation of all three techniques in a case study from practice, described in Sect. 4. 

Against this background, the remainder of this paper is structured as follows. 
Section 2 presents related work on remaining useful life prediction. Section 3 
presents similarity-based remaining useful life prediction techniques, including 
the new technique. Section 4 compares the performance of the various techniques 
in a case study and Sect.5 presents the conclusions. 
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2 Related Work 


RUL prediction can be considered a specialized form of survival analysis [10]. 
Essentially, two types of techniques exist for predicting RUL: model-based and 
data-driven techniques. Model-based techniques use physical models to accu- 
rately represent the wear and tear of a component over time [5]. Data-driven 
techniques do not presume any knowledge about how a component wears out 
over time, but merely predicts the RUL based on past observations. Hybrid mod- 
els, which are a combination of physical and data-driven techniques, also exist 
[9]. This paper focuses on data-driven models, which are most suited when the 
physical mechanisms that cause a component to fail are too complex to model 
cost-effectively, or if they are not sufficiently understood. 

A large number of data-driven techniques is available that fall into two classes 
depending on whether or not a probability distribution of the RUL must be 
obtained or a point-estimate is sufficient [1]. A probability distribution of the 
RUL has several benefits [16, 17, 20]. For example, it facilitates stochastic decision 
making, where maintenance is done when the probability that a part will fail 
exceeds a certain threshold, which is in line with the way in which maintenance 
decisions are made. When it is not necessary to produce a probability density 
function, several models can be used. The most obvious choices include regression 
models that use time as the primary independent variable and time-series models. 
However, regression models require that the behavior of the curve is predictable 
over time [4,13] and time-series [12] models are only suitable for short-term 
predictions [3,16] or when the behavior of the curve is predictable over time. 
Regression models that take other variables into account can also be used [6]. 
Such models have the benefit that they do not only consider the dependency 
of the RUL on the time that the part has been in operation, but also on other 
relevant factors, such as the operational temperature or vibration of the part. 

When the RUL depends on other factors beyond time, but data on such 
factors is not available, one can include them as a black box. While we may 
not know the values of relevant factors, we can still find historical runs that are 
similar to the current run. If we assume that the factors that influenced histor- 
ically similar runs are also similar to the current run, then the future behavior 
of the current run will also be similar to the behavior of the historically similar 
runs. This is called Trajectory Based Similarity Prediction (TBSP) [11,18,19]. 
Bayesian updating techniques use a similar principle [7,8]. Such techniques cre- 
ate a prior probability distribution of the RUL (based on data from historical 
runs to failure), which updates as more data of the current run is revealed. 


3 Prediction Techniques 


This section presents similarity-based techniques that can be used for RUL pre- 
diction: TBSP and Bayesian updating, which are defined in related work as 
explained in Sect. 2. Subsequently, Sect. 3.4 presents a novel technique, Bayesian 
updating with similarity-based prior estimation, which is a combination of TBSP 
and Bayesian updating. 
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3.1 Preliminaries 
The remaining useful life of a part is defined as follows. 


Definition 1 (Remaining Useful Life (RUL)). Lett be a moment in a run 
and tg be the moment in the run at which the part fails. The Remaining Useful 
Life (RUL) at time t, r(t), is defined as r(t) = tg — t. 


Note that ‘failure’ can be interpreted broadly. It does not have to be the 
point at which the part breaks, but can also be the point at which the part 
reaches a condition in which it is not considered suitable for operation anymore, 
or a condition in which maintenance is considered necessary. Over time, multiple 
runs to failure will be observed, such as the runs to failure shown in Fig. 1. 


Definition 2 (Run to failure library). L is the library of past runs to failure. 
For each l € L, tl}; is the moment in the run at which the part fails, and g! (t) is 
the function that returns the condition of the part at time t of the run. 


The function g! (t) is created by fitting a curve on the condition measurements 
of the run. We consider the one-dimensional case here (i.e., the case in which 
we only measure the condition of the part), but this can easily be extended 
to a multi-dimensional case (i.e., the case in which we not only measure the 
condition of the part, but also external factors (i.e., other variables than the 
condition variable itself), such as the operating temperature or pressure) by 
considering the observations as vectors over multiple variables. We will also omit 
the superscript l if there can be no confusion about the run to which we refer. 


3.2 Trajectory-Based Similarity Prediction 


Figure2 shows a different 25 r r r l r : 
(cf. Fig. 1) representation of f 
a run to failure library. It 
shows all runs in the library, 
starting from the moment at 
which the condition variable 
starts to increase from the 
base condition. It also shows 
a ‘current’ run as a thicker, 
unfinished curve. The idea 
of trajectory-based similar- 
ity prediction is to find some 
number k of runs that are 0 
most similar to the current 

run. For each of these k sim- 

ilar runs, we know the time 

it took until the part failed. 
Trajectory-based Similarity Prediction (TBSP) estimates the time until failure 
as the mean failure time of the similar runs. 


N 


in 


Differential Pressure (bar) 


f L f L 1 L 1 
0 20 40 60 80 100 120 140 160 180 
Lifetime (days) 


Fig. 2. Example library of runs. 
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Definition 3 (Distance of current run to library run). At a moment in 
time t, let I be the number of observations made in the current run, with values 
Z1,--..,2, observed at times t,,...,t;, and letl € L be a library run. We denote 
by d'(t) any distance measure contrasting z1,...,zr with g'(t,),...,g!(tr). Let 
E'(t) and M'(t) denote Euclidean and Manhattan distance, respectively. 


Clearly other distance functions can and indeed have been used as well in the 
context of remaining useful life prediction [21]. An in-depth analysis of the dis- 
tance function that performs best for TBSP is beyond the scope of this work. 


Definition 4 (Fit of current run to library run). For each library run 
LE L, let d'(t) be defined as in Definition 3. The fit of the current run to l is: 


S(t) =e FO 


When, at time t of the current run, the library run l is found that fits the 
current run best, the remaining useful life of the current run can be predicted 
as the remaining useful life of that run I: r(t) = tlh — t. It is also possible to base 
the prediction of the remaining useful life on the best k runs; sensitivity to k 
is part of our experiments. If k > 1, we can also aggregate RUL predictions by 
weighted average, where the weights are the goodness of fit of the library runs 
to the current run. 


Definition 5 (Trajectory-based Similarity Prediction). For each library 
run l € L, let S'(t) be the fit of the run to the current run as per Definition 4 
and let r'(t) be the RUL of the run. Let L’ C L be the subset of past runs on 
which we want to base our RUL prediction. The predicted RUL of the current 


run, P(t), is: 
— ren HH) re 


A) Mier S(t) 


3.3 Bayesian Updating 


A Bayesian updating method has also been proposed to create a probability dis- 
tribution of the remaining useful life [7,8]. The probability distribution can be 
updated with each observation of the condition variable that is obtained. The 
method works by fitting an exponential model to the library runs and subse- 
quently updating that model with observations of the current run. 

Intuitively, looking at Fig. 2, Bayesian updating works by fitting a curve to 
each of the library runs or to a selection of library runs. Based on the resulting 
collection of curves, a prior probability distribution of the time until the part fails 
can be created, which represents the ‘probable’ curve that the current run —or 
in fact any run—will follow. The prior probability distribution can be updated 
each time a condition value is observed in the current run. This update leads to 
a posterior probability distribution that represents the curve that the current 
run will follow with a higher precision (smaller confidence interval). 
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Definition 6 (RUL probability density). For each library run l € L, let 
g'(t) be the function that returns the condition of the part at time t of the run. 
The condition function can be fitted as an exponential model that has the form: 


g! (E) = p + Oet tO- 0" 
Here, @ is the intercept, e(t) is the error term with mean 0 and variance o° 


0 and B are random variables. 


, and 


If we set ¢ = 0 and take the natural logarithm of both sides, we get: 
In(g!(t)) = 6’ + Bt + elt) 


where 6’ = In(@) + 40°. Considering that we have multiple runs l € L, it is 
possible to fit this equation multiple times to those runs and calculate values for 
6’, B and ø for each run. With these values, we can compute the prior probability 
distributions of 0’ and 8. We assume these distributions are normal distributions 
with means ph and pı and variances of and a7. While the prior distributions are 
created based on observations from library runs, the distribution can be updated 
as more observations become available in the current run. 


Proposition 1 (RUL probability density updating). Let 7(0’) and x(() 
be the prior distributions of the random variables from Definition 6 with means 
uo and uı and variances oô and of, where 0’ = In(0)+40? and o? is the variance 
of the error term. Furthermore, let there be I observed values, z1,...,2Z1, in the 
current run, made at times tı,...,tr, and fori € I, let L; = \n(z;) the natural 
logarithm of each observation. The posterior distribution is a bivariate normal 
distribution with 6’ and 3, whose means fg: and ug, variances of, and oR, and 
correlation coefficient p can be calculated as follows: 


(= Liod + piso?) (= Goi t+ o?) — (= 108 (= Litio? + mo?) 


P T iel iel iel 
0’! 
(Hlo? + o?) (= Coit o?) = (z to?) (= i: 
iel iel ier 
(|Z\oo + a”) (= Litio? + mo?) = 6 to?) (= Lioz + piso?) 
me icI icI iel 
(lo? + 07) > t?o? + o2) = D 1:03) > 108 
icI ier jel 
Si tof +0? 
2 2.2 = 
og = 0°05 5 
llog +02) (Zotto) — (St) ofa} 
iel ieI 
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og = 0o“ 0) 5 
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The proof of this proposition is given in [8]. Consequently, In(g'(t)) for the cur- 
rent run to failure l is normally distributed with mean and variance: 


1 
u(t) S poor + ugt — so a(t) S a9 4 opt to? + 2ptogrog 


With this information, the probability that future values of In(g!(t)) exceed the 
maximum acceptable condition at some time t can be computed. 


3.4 Bayesian Updating with Similarity-Based Prior Estimation 


The RUL probability density function in Definition 6 depends on estimated prior 
distributions of 0 and 8. These priors can be set through analyzing previous runs 
to failure, either based on the complete library of runs, or on a subset of the 
runs. More precisely, we can determine prior distributions as follows. 


Definition 7 (Prior distributions). For each library run l € L, let g'(t) be 
the exponential curve that is fitted to the observations in that run with parameters 
0" and B! as in Definition 6. For a subset M C L of runs, we can determine the 
mean and standard deviation of 6’ and B over all 6’ and B™. 


Consequently, our priors depend on the subset M C L of runs that we use. 
For example, we can determine our priors based on M = L, the complete set of 
runs. Here, we consider a variant of the Bayesian updating method in which the 
priors are set based on the runs that are most similar to the current run, using 
Definition 4 for similarity and thresholds to select the most similar runs. More 
precisely, we select our priors as follows. 


Definition 8 (Similarity-based prior distributions). Let t be the moment 
in time at which we determine our prior distributions and k be the number of 
similar runs on which we base them. Furthermore, let S'(t) be the similarity of 
a run l to the observations in the current run until time t as per Definition 4. 
The set of k most similar runs M C L at moment t is then defined as the set in 
which, for all runs m € M, there is no runl € L— M, such that S"(t) > S™(t). 


Note that this definition depends on variables t and k, which can therefore 
be expected to influence the performance of the technique. In our evaluation, we 
will explore the performance of the technique for different values of t and k. 


4 Evaluation 


In this section, we put the RUL prediction techniques introduced in Sect.3 to 
the test, in a case study with data from practice. 
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4.1 Case Study 


Our data originates from a chemical plant on the Chemelot Industrial Site!. The 
plant we investigate produces a steady flow of various chemical products; what- 
ever the product happens to be, an unwanted byproduct is always generated. 
Filters have been installed to obtain an untainted final product. These filters 
have a variable service life, ranging between two and eight days. When the fil- 
ter performs its function, it withholds residue of the unwanted byproduct. This 
residue gradually builds up, forming a cake which increases the resistance of the 
filter. The additional resistance is measured through an increase in differential 
pressure (dP), as illustrated in Fig. 1. An unclogged filter has a ôP of 0.2 bar. 
When ôP reaches a threshold of 2.4 bar, a valve in front of the filter is switched 
to let the product run through a parallel, clean filter, which returns dP to 0.2 
bar and enables engineers to maintain the clogged filter. 

Sensor data, including ôP, is stored in a NoSQL database as time series. 
Preprocessing is needed in several aspects. First, the data has many missing val- 
ues, which we replace by the last observed value. Second, the sensors generate 
a data point every second. We established experimentally that resampling the 
data to the minute barely loses any information from the signal, while still sub- 
stantially reducing the size of the dataset. Third, to avoid the amplification of 
clear outliers, they are removed with a Hampel filter [14]. Fourth, we focus on the 
‘exponential deterioration stage’ of the filter’s life cycle [5], because—according 
to the company—the start of that stage is early enough to be able to act on time, 
and because it provides us with a dataset that is suitable for similarity-based 
RUL prediction techniques. The start and end of the exponential deterioration 
stage must be derived from data. We do that by comparing the average pressure 
over the last hour with its preceding hour. To ensure that every run has only one 
start per stop, a detected start is ignored if another start was already detected 
in the same run. 


4.2 Results 


We quantify our results using an a— graph. Intuitively, this graph represents the 
probability that, at a certain moment in the run to failure, the RUL prediction 
(A) is within a pre-defined level of precision (a) [15]. We will use a concise 
representation of the a — A quality: rather than time into the run, we put the 
RUL on the z-axis, while the y-axis displays the probability. This representation 
allows us to visually compare different techniques. All analysis is done using 5- 
fold cross validation. The results presented in the graphs are the averages over 
the 5 folds. 

Figures 3a, b, and c show the performance of the TBSP technique for various 
parameter settings. Figure 3a compares the performance of TBSP when fitting 
various types of curves (second (‘poly2’) and third (‘poly3’) order polynomials, 


' An anonymized version of the data is made available at: https: //surfdrive.surf.n1/ 
files/index.php/s/1dTFF X{Z7woeSUA. 
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Fig. 3. Comparison of hyperparameter settings. 


exponential curves (‘exp1’), and the sum of two exponential curves (‘exp2’)), 
Fig. 3b compares Manhattan and Euclidean distance, and Fig. 3c shows the sen- 
sitivity to the number of similar curves k. The graphs show that TBSP performs 
best for an exponential curve in short term (<48 h) predictions, and for k = 2, 3, 
or 4, while there is little to no performance difference between Manhattan and 
Euclidean distance and between k = 2, 3, or 4. For those reasons, we param- 
eterize TBSP with exponential curves, using Euclidean distance as a distance 
metric, and using 3 similar curves to make the prediction. 

Figure 3d shows the performance of the Bayesian updating technique for var- 
ious prior sets of runs on which the prior is based. We consider four alternatives. 
In the first alternative, no prior is defined and the prediction is only computed 
based on the current run. In the second alternative, the prior distribution is 
based on all runs in the library. In the third alternative, we create a prior dis- 
tribution by fitting the run with the (closest to) average run to failure time. In 
the fourth alternative, we create a prior distribution by fitting the shortest, the 
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longest, and the average run. The figure shows that for long term predictions, a 
prior fitted on the ‘average’, the shortest and longest run performs best, while 
for short term predictions, a prior fitted on the whole library performs best. 


Probability within a bound, a = 20% 


Probability 


Fig. 4. Performance across moments for setting priors. 


Figure 4 shows the performance of Bayesian updating with similarity-based 
priors for various settings of the moment at which the priors are determined. The 
best performance is obtained when priors are determined 5h into the current run 
to completion; 10, 15, and 20h were also considered. The number of similar runs 
on which the priors are based is also a parameter for Bayesian updating with 
similarity-based priors. The priors are based on the 3 most similar runs. This 
led to the best results when comparing results for priors based on 1, 2, 3, 4, 5, 
and 10 similar runs. 

Figure5 shows the results for the various prediction techniques: TBSP, 
Bayesian Updating, and Bayesian updating with similarity-based priors. The 
results show a clear distinction in the performance of the different techniques. 
TBSP performs best for short-term (<48h before failure) predictions, while 
Bayesian updating with similarity-based priors performs best in the long term 
(150-200 h before failure). This is expected, because for long-term prediction, 
Bayesian updating with similarity-based priors benefits from being based both 
on similar runs and on general Bayesian behavior, while after some updates 
the impact of the priors is reduced and the behavior approaches that of nor- 
mal Bayesian updating. TBSP on the other hand benefits from having a better 
estimate of the runs to which it is close as time progresses. 
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Fig. 5. Overall comparison of techniques. 


5 Conclusions 


In a case study, we show how techniques from literature can be combined 
and parameterized to accurately predict the Remaining Useful Life (RUL) of 
a machine or part. While curves of the degradation of a machine or part over 
time typically have a similar shape, the challenge is that operational constraints, 
which may be unknown, influence the exact parameterization of that curve, as 
evidenced by the real-life runs displayed in Figs. 1 and 2. Therefore, we propose 
a similarity-based prediction technique: while it makes no sense to compare the 
current run with all previously observed runs, it is quite likely that there are 
some historical runs that are similar to the current run, because they have similar 
operational constraints, hence providing us with powerful predictive information. 

This paper proposes a new similarity-based prediction technique, in which we 
obtain a probability distribution of the RUL through Bayesian updating, where 
the priors of the Bayesian distribution are calculated based on a careful selection 
of previously seen runs. As evidenced by Fig. 5, our technique outperforms alter- 
native techniques in a case study by a large margin within the long-term region. 
If we strive to predict the RUL shorter in advance, Fig.5 clearly indicates that 
other methods work better. 

While we studied the performance of RUL prediction techniques in the con- 
text of a particular case study, in many other domains degradation patterns have 
similar properties. In particular, in many other domains: run to failure data has 
a ‘sawtooth’ shape as in Fig. 1, degradation depends on operational conditions 
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that are unknown (e.g., because they are not measured), and long-term predic- 
tions are of interest (e.g., for planning maintenance activities). In such situations 
our technique can also be expected to work well. 
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Abstract. A large amount of data accommodated in knowledge graphs 
(KG) is metric. For example, the Wikidata KG contains a plenitude 
of metric facts about geographic entities like cities or celestial objects. 
In this paper, we propose a novel approach that transfers orometric 
(topographic) measures to bounded metric spaces. While these meth- 
ods were originally designed to identify relevant mountain peaks on the 
surface of the earth, we demonstrate a notion to use them for metric 
data sets in general. Notably, metric sets of items enclosed in knowledge 
graphs. Based on this we present a method for identifying outstand- 
ing items using the transferred valuations functions isolation and promi- 
nence. Building up on this we imagine an item recommendation pro- 
cess. To demonstrate the relevance of the valuations for such processes, 
we evaluate the usefulness of isolation and prominence empirically in 
a machine learning setting. In particular, we find structurally relevant 
items in the geographic population distributions of Germany and France. 


Keywords: Metric spaces - Orometry - Knowledge graphs - 
Classification 


1 Introduction 


Knowledge graphs (KG), such as DBpedia [15] or Wikidata [24], are the state 
of the art for storing information and to draw knowledge from. They represent 
knowledge through graphs and consist essentially of items which are related 
through properties and values. This enables them to fulfill the task of giving 
exact answers to exact questions. However, their ability to present a concise 
overview over collections of items with metric distances is limited. The number 
of such data sets in Wikidata is tremendous, e.g., the set of all cities of the world, 
including their geographic coordinates. Further examples are celestial bodies and 
their trajectories or, more general, feature spaces of data mining tasks. 

One approach to understand such metric data is to identify outstanding ele- 
ments, i.e., outstanding items. Based on such elements it is possible to compose 
or enhance item recommendations to users. For example, such recommenda- 
tions could provide a set of the most relevant cities in the world with respect 
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Fig. 1. Isolation: minimal horizontal distance to another point of at least equal height. 
Prominence: minimal vertical descent to reach a point of at least equal height. 


to being outstanding in their local surroundings. However, it is a challenging 
task to identify outstanding items in metric data sets. In cases where the metric 
space is equipped with an additional valuation function, this task becomes more 
feasible. Such functions, often called scores or height functions, are often natu- 
rally provided: cities may be ranked by population; the importance of scientific 
authors by the h-index [12]. A naive approach for recommending relevant items 
in such settings would be: items with higher scores are more relevant items. As 
this method seems reasonable for many applications, some obstacles arise if the 
“highest” items concentrate into a specific region of the underlying metric space. 
For example, representing the cities of the world by the twenty most populated 
ones would include no western European city.! Recommending the 100 highest 
mountains would not lead to knowledge about the mountains outside of Asia.” 

Our novel approach shall overcome this problem: we combine the valuation 
measure (e.g., “height” ) and distances, to provide new valuation functions on the 
set of items, called prominence and isolation. These functions do rate items based 
on their height in relation to the valuations of the surrounding items. This results 
in valuation functions on the set of items that reflect the extend to which an item 
is locally outstanding. The basic idea is the following: the prominence values an 
item based on the minimal descent (w.r.t. the height function) that is needed 
to get to another point of at least same height. The isolation, sometimes also 
called dominance radius, values the distance to the next higher point w.r.t. the 
metric (Fig. 1). These measures are adapted from the field of topography where 
isolation and prominence are used in order to identify outstanding mountain 
peaks. We base our approach on [22], where the authors proposed prominence 
and dominance for networks. We generalize these to the realm of bounded metric 
space. 

We provide insights to the novel valuation functions and demonstrate their 
ability to identify relevant items for a given topic in metric knowledge graph 
applications. The contributions of this paper are as follows: e We propose promi- 
nence and isolation for bounded metric spaces. For this we generalize the results 
in [22] and overcome the limitations to finite, undirected graphs. e We demon- 
strate an artificial machine learning task for evaluating our novel valuation func- 
tions in metric data. e We introduce an approach for using prominence and iso- 


1 https: //en.wikipedia.org/wiki/List_of_largest_cities on 2019-06-16. 
? https: //en.wikipedia.org/wiki/List_of highest_mountains_on Earth on 2019-06-16. 
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lation to enrich metric data in knowledge graphs. We show empirically that this 
information helps to identify a set of representative items. 


2 Related Work 


Item recommendations for knowledge graphs is a contemporary topic of high 
interest in research. Investigations cover for example music recommendation 
using content and collaborative information [17] or movie recommendations using 
PageRank like methods [5]. The former is based on the common notion of embed- 
ding, i.e., embedding of the graph structure into d-dimensional real vector spaces. 
The latter operates on the relational structure itself. Our approach differs from 
those as it is based on combining a valuation measure with the metric of the 
data space. Nonetheless, given an embedding into an finite dimensional real vec- 
tor space, one could apply isolation and prominence in those as well. 

The novel valuation functions prominence and isolation are inspired by topo- 
graphic measures, which have their origin in the classification of mountain peaks. 
The idea of ranking peaks solely by their absolute height was already deprecated 
in 1978 by Fry in his work [8]. The author introduced prominence for geographic 
mountains, a function still investigated in this realm, e.g., in Torres et al. [23], 
where the authors used deep learning methods to identify prominent mountain 
peaks. Another recent step for this was made in [14], where the authors inves- 
tigated methods for discovering new ultra-prominent mountains. Isolation and 
more valuations functions motivated in the orometric realm are collected in [11]. 
A well-known procedure for identifying peaks and saddles in 3D terrain data 
is described in [6]. However, these approaches rely on data that approximates 
a continuous terrain surface via a regular square grid or a triangulation. Our 
data cannot fulfill this requirement. Recently the idea of transferring orometric 
functions to different realms of research gained attention: The authors of [16] 
used topographic prominence to identify population areas in several U.S. States. 
In [22] the authors Schmidt and Stumme transferred prominence and dominance, 
i.e., isolation, to co-author graphs in order to evaluate their potential of identi- 
fying ACM Fellows. We build on this for proposing our valuation functions on 
bounded metric data. This generalization results in a wide range of applications. 


3 Mathematical Modeling 


While the Wikidata knowledge graph itself could be analyzed with the promi- 
nence and isolation measures for networks, this paper focuses on bounded metric 
data sets. To analyze such data sets is more sufficient, since real world networks 
often suffer from a small average shortest path length [26]. This leads to a low 
amount of outstanding items: an item is outstanding if it is “higher” than the 
items that have a low distance to it. This leads to a strict measure for many real- 
world network data when the shortest path length is used as the metric function. 
Hence, we model our functions for bounded metric data instead of networks. 
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We consider the following scenario: We have a data set M, consisting of a set 
of items, in the following called points, equipped with a metric d and a valuation 
function h, in the following called height function. The goal of the orometric 
(topographic) measures prominence and isolation is, to provide measures that 
reflect the extent to which a point is locally outstanding in its neighborhood. 

More precisely, let M be a non-empty set and d: M x M — Rso. We call 
d a metric on the set M iff e Va,y € M : d(x,y) 0 xc = y, and 
e d(x,y) = d(y,x) for all x,y € M, called symmetry, and è Vz,y,z E€ M : 
d(x,z) < d(x,y) + d(y, z), called triangle inequality. If d is a metric on M, we 
call (M,d) a metric space and if M is finite we call (M, d) a finite metric space. 
If there exists a C € Rso such that we have d(m,n) < C for all m,n € M, we 
call (M, d) bounded. For the rest of our work we assume that |M| > 1 and (M, d) 
is a bounded metric space. Additionally, we have that M is equipped with a 
height function (valuation/score function) h: M — Rso,m+ h(m). 


Definition 1 (Isolation). Let (M,d) be a bounded metric space and let h : 
M — R>o be a height function on M. The isolation of a point x € M is then 
defined as follows: 


- If there is no point with at least equal height to m, than iso(m) = 
sup{d(m,n) | n E€ M}. The boundedness of M guarantees the existence of 
this supremum. 

— If there is at least one other point in M with at least equal height to m, we 
define its isolation by: 


iso(m) = inf{d(m,n) | n E M \ {m} A h(n) > h(m)}. 


The isolation of a mountain peek is often called the dominance radius or 
sometimes the dominance. Since the term orometric dominance of a mountain 
sometimes refers to the quotient of prominence and height, we will stick to the 
term isolation to avoid confusion. While the isolation can be defined within 
the given setup, we have to equip our metric space with some more structure 
in order to transfer the notion of prominence. Informally, the prominence of 
a point is given by the minimal vertical distance one has to descend to get 
to a point of at least the same height. To adapt this measure to our given 
setup in metric spaces with a height function, we have to define what a path 
is. Structures that provide paths in a natural way are graph structures. For 
a given graph G = (V,F) with vertex set V and edge set Æ C CG) walks 
are defined as sequences of nodes {v;i}; o which satisfy {v;—1, v;i} € E for all 
i € {1,...,n}. If we also have v; Æ vj for i 4 j, we call such a sequence a path. 
For v,w € V we say v and w are connected iff there exists a path connecting 
them. Furthermore, we denote by G(v) the connected component of G containing 
v, i.e., G(v) = {w € V |v is connected with w}. 

To use the prominence measure as introduced by Schmidt and Stumme 
in [22], which is indeed defined on graphs, we have to derive an appropriate 
graph structure from our metric space. The topic of graphs embedded in finite 
dimensional vector spaces, so called spatial networks [2], is a topic of current 
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interest. These networks appear in real world scenarios frequently, for example 
in the modeling of urban street networks [13]. Note that our setting, in contrast 
to the afore mentioned, is not based on a priori given graph structure. In our 
scenario the graph structure must be derived from the structure of the given 
metric space. 

Our approach is, to construct a step size graph or threshold graph, where we 
consider points in the metric space as nodes and connect two points through an 
edge, iff their distance is smaller then a given threshold ô. 


Definition 2 (d-Step Graph). Let (M,d) be a metric space and ô > 0. We 
define the 6-step graph or 6-threshold graph, denoted by Gs, as the tuple (M, Es) 
via 


Es := {{m,n} € @ | d(m,n) < ô}. (1) 


This approach is similar to the one found in the realm of random geometric 
graphs, where it is common sense to define random graphs by placing points 
uniformly in the plane and connect them via edges if their distance is less than 
a given threshold [21]. Since we introduced a possibility to derive a graph that 
just depends on the metric space, we use a slight modification of the definition 
of prominence compared to [22] for networks. 


Definition 3 (Prominence in Networks). Let G = (V, E) be a graph and 
let hh: V — Rso be a height function. The prominence promg(v) of v € V is 
defined by 

promg(v) = min{h(v), mindescg(v) } (2) 


where mindescg(v) = inf{max{h(v) — h(u) | u € p} | p € Py}. The set P, 
contains of all paths to vertices w with h(w) > h(v), i.e., Py = {{vi} o €P | 
Vp =V AUV Æ VA Rvp) > h(v)}, where P denotes the set of all paths of G. 


Informally, mindescg(v) reflects on the minimal descent in order to get to a 
vertex in G which has a height of at least h(v). For this the definition makes use 
of the fact that inf Ø = co. This case results in promg(v) being the height of v. 
A distinction to the definition in [22] is, that we now consider all paths and not 
just shortest paths. This change better reflects the calculation of the prominence 
for mountains. Based on this we transfer the notions above to metric spaces. 


Definition 4 (d-Prominence). Let (M,d) be a bounded metric space and h : 
M — Rso be a height function. We define the 6-prominence prom;(m) ofm € M 
as promg,(v), i.e., the prominence of m in Gs from Definition 2. 


We now have a prominence term for all metric spaces that depends on a 
parameter 6 to choose. For all knowledge procedures, choosing such a parameter 
is a demanding task. Hence, we want to provide in the following a natural choice 
for ô. We consider only those values for ô such that corresponding G's does not 
exhibit noise, i.e., there is no element without a neighbor. 
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Definition 5 (Minimal Threshold). For a bounded metric space (M,d) with 
|M| > 1 we define the minimal threshold ôm of M as 


ôm := supt{inf{d(m,n) |ne M\{m}} | me M}. 


Based on this definition a natural notion of prominence for metric spaces 
(equipped with a height function) emerges via a limit process. 


Lemma 1. Let M be a bounded metric space and ôm as in Definition 5. For 
m E€ M the following descending limit exists: 


li O ; 3 
pin opr ms(m) (3) 


Proof. Fix any ô > ôm and consider on the open interval from ôm to ô the 
function that maps 6 to proms(m): prom, )(7™) : ]5ar,6[— R, — prom;(m). It 
is known that it is sufficient to show that prom, )(7™m) is monotone decreasing and 
bounded from above. Since we have for any 6 that prom;(m) < h(m) holds, we 
need to show the monotony. Let 61,62 be in Jóm, ô [ with 6, < 69. If we consider 
the corresponding graphs (M, Es, ) and (M, Es,), it easy to see Es, C Es,. Hence, 
we have to consider more paths in Eq. (2) for Es,, resulting in a not larger value 


for the infimum. We obtain proms, (m) > proms, (m), as required. 


Definition 6 (Prominence in Metric Spaces). If M is a bounded metric 
space with |M| > 1 and a height function h, the prominence prom(m) of m is 
defined as: 

prom(m) := oo prom,;(m). 

Note, if we want to compute prominence on a real world finite metric data 
set, it is possible to directly compute the prominence values: in that case the 
supremum in Definition 5 can be replaced by a maximum and the infimum by a 
minimum, which leads to prom(m) being equal to proms, (m). There are results 
for efficiently creating such step graphs [3]. However, for our needs in this work, 
in particular in the experiment section, a quadratic brute force approach for 
generating all edges is sufficient. We want to show that our prominence definition 
for bounded metric spaces is a natural generalization of Definition 3. 


Lemma 2. Let G = (V, E) be a finite, connected graph with |V| > 2. Consider 
V equipped with the shortest path metric as a metric space. Then the prominence 
promg(:) from Definition 3 and prom(-) from Definition 6 coincide. 


Proof. Let M := V be equipped with the shortest path metric d on G. As G 
is connected and has more than one node, we have ôm = 1. Hence, (M, Esm) 
from Definition 2 and G are equal. Therefore, the prominence terms coincide. 
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4 Application 


Score Based Item Recommending. As an application we envisage a general app- 
roach for a score based item recommending process. The task of item recom- 
mending with knowledge graphs is a current research topic [17,18]. However, 
most approaches are solely based on knowledge about preferences of the user 
and graph structural properties, often accessed through KG embeddings [19]. 
The idea of the recommendation process we imagine differs from those. We stip- 
ulate on a procedure that is based on the information entailed in the connection 
of the metric aspects of the data together with some (often naturally present) 
height function. We are aware that this limits our approach to metric data in 
KGs. Nonetheless, given the large amounts of metric item sets in prominent KGs, 
we claim the existence of a plenitude of applications. For example, while consid- 
ering sets of cities, such a system could recommend a relevant subset, based on 
a height function, like population, and a metric, like geographical distances. By 
doing so, we introduce a source of information for recommending metric data in 
relational structures, like KGs. A common approach for analyzing and learning 
in KGs is embedding. There is an extensive amount of research about that, see 
for example [4,25]. Since our novel methods rely solely on bounded metric spaces 
and some valuation function, one may apply those after the embedding step as 
well. In particular, one may use isolation and prominence for investigating or 
completing KG embeddings. This constitutes our second envisioned application. 
Finally, common item recommending scores/ranks can also be used as height 
functions in our sense. Hence, computing prominence and isolation for already 
setup recommendation systems is another possibility. Here, our valuation func- 
tions have the potential to enrich the recommendation process with additional 
information. In such a way our measures can provide a novel additional aspect to 
existing approaches. The realization and evaluation of our proposed recommen- 
dation approach is out of scope of this paper. Nonetheless, we want to provide 
some first insights for the applicability of valuation functions for item sets based 
on empirical experiments. As a first experiment, we will evaluate if isolation and 
prominence help to separate important and unimportant items in specific item 
sets in Wikidata. In detail, we evaluate if the valuation functions help to differen- 
tiate important and unimportant municipalities in France and Germany, solely 
based on their geographic metric properties and their population as height. 


4.1 Resulting Questions 


Given a bounded metric space M which represents the data set and a given 
height h. The following questions shall evaluate if our functions isolation and 
prominence provide useful information about the relevance of given points in the 
metric space. If (M,d,h) is a metric space equipped with an additional height 
function, let c: M — {0,1} be a binary function that classifies the points in the 
data set as relevant (1) or not (0). We connect this to our running example using 
a function that classifies municipalities having a university (1) and municipalities 
that do not have an university (0). We admit that the underlying classification 
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is not meaningful in itself. It treats a real geographic case while our model could 
also handle more abstract scenarios. However, since this setup is essentially a 
benchmark framework (in which we assume cities with universities to be more 
relevant) we refrain from employing a more meaningful classification task in favor 
of a controllable classification scenario. Our research questions are now: 1. Are 
prominence and isolation alone characteristical for relevance? We use 
isolation and/or prominence for a given set of data points as features. To which 
extend do these features improve learning a classification function for relevance? 
2. Do prominence and isolation provide additional information, not 
catered by the absolute height? Do prominence and isolation improve the 
prediction performance of relevance compared to just using the height? Does 
a classifier that uses prominence and isolation as additional features produce 
better results than a classifier that just uses the height? We will evaluate the 
proposed setup in the realm of a KG and take on the questions stated above in 
the following section and present some experimental evidence. 


5 Experiments 


We extract information about municipalities in the countries of Germany and 
France from the Wikidata KG. This KG is a structure that stores knowledge 
via statements, linking entities via properties to values. A detailed description 
can be found in [24], while [9] gives an explicit mathematical structure to the 
Wikidata graph and shows how to use the graph for extracting implicational 
knowledge from Wikidata subsets. We investigate if prominence and isolation of 
a given municipality can be used as features to predict university locations in a 
classification setup. We use the query service of Wikidata? to extract points in 
the country maps from Germany and France and to extract all their universities. 
We report all necessary SPAQRL queries employed on GitHub.* 


— Wikidata provides different relations for extracting items that are instances 
of the notion city. The obvious choice is to employ the instance of (P31) 
property for the item city (Q515). Using this, including subclass of (P279), 
we find insufficient results. More specific, we find only 102 French cities and 
2215 German cities.” For Germany, there exists a more commonly used item 
urban municipality of Germany (Q42744322) for extracting all cities, while 
to the best of our knowledge, a counterpart for France is not provided. 

— The preliminary investigation leads us to use municipality (Q15284), again 
including the subclass of (P279) property, with more than 5000 inhabitants. 

— Since there are multiple french municipalities that are not located in the 
mainland of France, we encounter problems for constructing the metric space. 
To cope with that we draw a basic approximating square around the mainland 
of France and consider only those municipalities inside. 


3 https: //query.wikidata.org/. 
* https: //github.com/mstubbemann/Orometric- Methods-in- Bounded-Metric-Data. 
5 Queried on 2019-08-07. 
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— We find the class of every municipality, i.e, university location or non- 
university location as follows. We use the properties located in the admin- 
istrative territorial entity (P131) and headquarters location (P159) on the 
set of all universities and checked if these are set in Germany or France. An 
example of a University that has not set P131 is TU Dortmund (Q685557).° 

— We match the municipalities with the university properties. This is necessary 
because some universities are not related to municipalities through P131, e.g., 
Hochschule Niederrhein (Q1318081) is located in the administrative location 
North Rhine-Westphalie (Q1198) (See footnote 6), which is a federal state 
containing multiple municipalities. For these cases we check the university 
locations manually. This results in 2064 municipalities (89 university loc.) in 
France and 2986 municipalities (160 university loc.) in Germany. 

— While constructing the data set we encounter twenty-two universities that are 
associated to a country having neither located in the administrative territorial 
entity (P131) nor headquarters location (P159). We check them manually and 
are able to discard them all for different reasons. 


5.1 Binary Classification Task 


Setup. We compute prominence and isolation for all data points and normalize 
them as well as the height. The data that is used for the classification task 
consists of the following information for each city: The height, the prominence, 
the isolation and the binary information whether the city has a university. Since 
our data set is highly imbalanced, common classifiers tend to simply predict the 
majority class. To overcome the imbalance, we use inverse penalty weights with 
respect to the class distribution. We want to stress out again that the goal for the 
to be introduced classification task is not to identify the best classifier. Rather 
we want to produce evidence for the applicability of employing isolation and 
prominence as features for learning a classification function. We decide to use 
logistic regression with L? regularization and Support Vector Machines [7] with 
a radial kernel. For our experiment we use Scikit-Learn [20]. As penalty factor for 
the SVC we set C = 1, and experiment with C € {0.5,1,2,5, 10,100}. For y we 
rely on previous work by [1] and set it to one. For all combinations of population, 
isolation and prominence we use 100 iterations of 5-fold-cross-validation. 


Evaluation. We use the g-mean (i.e., geometric mean) as evaluation function. 
Consider for this denotations TN (True Negative), FP (False Positive), FN (False 
Negative), and TP (True Positive). Overall accuracy is highly misleading for 
heavily imbalanced data. Therefore, we evaluate the classification decisions by 
using the geometric mean of the accuracy on the positive instances, acc, := 
TERFN and the accuracy on the negative instances acc_ := Parr Hence, 
the g-mean score is then defined by the formula gmean := Jfacce,- acc—. T he 
evaluation function g-mean is established in the topic of imbalanced data mining. 
It is mentioned in [10] and used for evaluation in [1]. We compare the values for 


® Last checked on 2019-10-26. 
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Table 1. Results of the classification task. We do 100 rounds of 5-fold-cross-validation 
and shuffle the data between the rounds. For all rounds we compute the g-mean value 
and then compute the average over the 100 rounds. 


Country France Germany 
Classifier SVM LR SVM LR 
Mean | Std Mean _ | Std Mean _ | Std Mean | Std 

iso 0.7416 | 0.0059 | 0.7703 | 0.0034 | 0.7463 | 0.0028 0.7761 | 0.0035 
pro 0.4861 | 0.0053 | 0.6362 | 0.0055 | 0.3998 | 0.0068 0.5750 | 0.0049 
pop 0.6940 | 0.0031 | 0.7593 | 0.0086 | 0.5982 | 0.0038 0.7134 | 0.0043 
iso+pro 0.7329 | 0.0067 | 0.7657 | 0.0066 | 0.7320 | 0.0042 | 0.7642 | 0.0041 
iso+pop 0.7668 | 0.0086 | 0.7812 | 0.0039 | 0.7971 | 0.0041 0.8068 | 0.0038 
pro+pop 0.7011 | 0.0040 | 0.7496 | 0.0051 | 0.6134 | 0.0050 0.7108 | 0.0065 
iso+pro+pop | 0.7653 | 0.0078 | 0.7778 | 0.0052 | 0.7947 | 0.0042 | 0.8006 | 0.0042 


po= population, pr = prominence, is = isolation 
SVM=Support Vector Machine, LR = Logistic Regression 


g-mean for the following cases. First, we train a classifier function purely on 
the features population, prominence or isolation. Secondly, we try combinations 
of them for the training process. We consider the classifier trained using the 
population feature as baseline. An increase in g-mean while using prominence 
or isolation together with the population function is evidence for the utility of 
the introduced valuation functions. Even stronger evidence is a comparison of 
isolation/prominence trained classifiers versus baseline. 

In our experiments, we are not expecting high g-mean values, since the place- 
ment of university locations depends on many additional features, including 
historical evolution of the country and political decisions. Still, the described 
evaluation setup is sufficient to demonstrate the potential of the novel features. 


Results. The results of the computations are depicted in Table 1. e Isolation is 
a good indicator for structural relevance. For both countries and classifiers iso- 
lation outperforms population. e Combining absolute height with our valuation 
functions leads to better results. e Prominence is not useful as a solo indicator. 
We draw from our result that prominence solely is not a useful indicator. Promi- 
nence is a very strict valuation function: recall that we constructed the graphs by 
using distance margins as indicators for edges, leading to a dense graph structure 
in more dense parts of the metric space. Hence, a point in a more dense part 
has many neighbors and thus many potential paths that may lead to a very low 
prominence value. From Definition 3 we see that having a higher neighbor always 
leads to a prominence value of zero. This threshold is about 34km for Germany 
and 54km for France. Thus, a municipality has a not vanishing prominence if it 
is the most populated point in a radius of over 34km, respectively 54km. Only 
75 municipalities of France have non zero prominence, with 40 of them being 
university locations. Germany has 104 municipalities with positive prominence 
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with 72 of them being university locations. Thus, prominence alone as a feature 
is insufficient for the prediction of university locations. e Support vector machine 
and logistic regression lead to similar results. To the question, whether our valu- 
ation functions improve the classification compared with the population feature, 
support vector machines and logistic regressions provide the same answer: iso- 
lation always outperforms population, a combination of all features is always 
better then using just the plain population feature. e Support vector machine 
penalty parameter. Finally, for our last test we check the different results for 
support vector machines using the penalty parameters C € {0.5,1,2,5,10, 100}. 
We observe that increasing the penalty results in better performance using the 
population feature. However, for lower values of C, i.e., less overfitting models, 
we see better performance in using the isolation feature. In short, the more the 
model overfits due to C, the less useful are the novel valuation functions we 
introduced in this paper. 


6 Conclusion and Outlook 


In this work, we presented a novel approach to identify outstanding elements in 
item sets. For this we employed orometric valuation functions, namely promi- 
nence and isolation. We investigated a computationally reasonable transfer to 
the realm of bounded metric spaces. In particular, we generalized previously 
known results that were researched in the field of finite networks. 

The theoretical work was motivated by the observation that KGs, like Wiki- 
data, do contain huge amounts of metric data. These are often equipped with 
some kind of height functions in a natural way. Based on this we proposed in 
this work the groundwork for a locally working item recommending scheme. 

To evaluate the capabilities for identifying locally outstanding items we 
selected an artificial classification task. We identified all French and German 
municipalities from Wikidata and evaluated if a classifier can learn a meaningful 
connection between our valuation functions and the relevance of a municipal- 
ity. To gain a binary classification task and to have a benchmark, we assumed 
that universities are primarily located at relevant municipalities. In consequence, 
we evaluated if a classifier can use prominence and isolation as features to pre- 
dict university locations. Our results showed that isolation and prominence are 
indeed helpful for identifying relevant items. 

For future work we propose to develop the conceptualized item recommender 
system and to investigate its practical usability in an empirical user study. Fur- 
thermore, we urge to research the transferability of other orometric based valu- 
ation functions. 
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Abstract. While neural networks are powerful approximators used to 
classify or embed data into lower dimensional spaces, they are often 
regarded as black boxes with uninterpretable features. Here we pro- 
pose Graph Spectral Regularization for making hidden layers more inter- 
pretable without significantly impacting performance on the primary 
task. Taking inspiration from spatial organization and localization of neu- 
ron activations in biological networks, we use a graph Laplacian penalty 
to structure the activations within a layer. This penalty encourages acti- 
vations to be smooth either on a predetermined graph or on a feature- 
space graph learned from the data via co-activations of a hidden layer 
of the neural network. We show numerous uses for this additional struc- 
ture including cluster indication and visualization in biological and image 
data sets. 


Keywords: Neural Network Interpretability - Graph learning - 
Feature saliency 


1 Introduction 


Common intuitions and motivating explanations for the success of deep learning 
approaches rely on analogies between artificial and biological neural networks, 
and the mechanism they use for processing information. However, one aspect 
that is overlooked is the spatial organization of neurons in the brain. Indeed, 
the hierarchical spatial organization of neurons, determined via fMRI and other 
technologies [13,16], is often leveraged in neuroscience works to explore, under- 
stand, and interpret various neural processing mechanisms and high-level brain 
functions. In artificial neural networks (ANN), on the other hand, hidden layers 
offer no organization that can be regarded as equivalent to the biological one. 
This lack of organization poses great difficulties in exploring and interpreting 
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the internal data representations provided by hidden layers of ANNs and the 
information encoded by them. This challenge, in turn, gives rise to the com- 
mon treatment of ANNs as black boxes whose operation and data processing 
mechanisms cannot be easily understood. To address this issue, we focus on the 
problem of modifying ANNs to learn more interpretable feature spaces without 
degrading their primary task performance. 

While most neural networks are treated as black boxes, we note that there are 
methods in ANN literature for understanding the activations of filters in convo- 
lutional neural networks (CNNs) [11], either by examining trained networks [24], 
or by learning a better representation [12,17,18,22,25], but such methods rarely 
apply to other types of networks, in particular dense neural networks (DNNs) 
where a single activation is often not interpretable on its own. Furthermore, con- 
volutions only apply to datatypes where we know the feature structure apriori, 
as in the case of images and natural language. In layers of a DNN, there is no 
enforced structure between neurons. The correspondence between neurons and 
concepts is only determined based on the random initialization of the network. 
In this work, we encourage structure between neurons in the same layer, creating 
more localized and interpretable layers in dense architectures. 

More specifically we propose a Graph Spectral Regularization to encourage 
arbitrary graph structure between neurons within a layer. The internal layers of 
a neural network are constrained to take the structure of a graph, with graph 
neighbors activating on similar inputs. This allows us to map the activations of 
a given layer over the graph and interpret new input by examining the activa- 
tions. We show that graph-structuring a hidden layer causes useful, interpretable 
features to emerge. For instance, we show that grid-structuring a layer of a clas- 
sification network creates a structure over which convolution can be applied, and 
local receptive fields can be traced to understand classification decisions. 

While a majority of the time imposing a known graph structure gives inter- 
pretable results, there are circumstances where we would like to learn the graph 
structure from data. In such cases we can learn and emphasize the natural graph 
structure of the feature space. We do this by an iterative process of encoding 
the data, and modifying the graph based on the feature co-activation patterns. 
This procedure reinforces existing patterns in the data. This allows us to learn 
an abstracted graph structure of features in high-dimensional domains such as 
single-cell RNA sequencing. 

The main contributions of this work are as follows: (1) Demonstration of 
hierarchical, spatial, and smoothed feature maps for interpretability in dense 
networks. (2) A novel method for learning and reinforcing the natural graph 
structure for complex feature spaces. (3) Demonstration of graph learning and 
abstraction on single-cell RNA-sequencing data. 


2 Related Work 


Disentangled Representation Learning: While there is no precise definition of 
what makes for a disentangled representation, the aim is to learn a representa- 
tion that axis aligns with the generative factors of the data [2,8]. [9] suggest a 
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way to disentangle the representation of variational autoencoders [10] with 8- 
VAE. Subsequent work has generalized this to discrete representations [5], and 
simple hierarchical representations [6]. These works focus on learning a single 
vector representation of the data, where each element represents a single con- 
cept. In contrast, our work learns a representation where groups of neurons may 
be involved in representing a single concept. Moreover, disentangled represen- 
tation learning can only be applied to unsupervised models and only the most 
compressed level of either an autoencoder [9] or generative adversarial network 
as in [4], whereas graph spectral regularization (GSR) can be applied to any or 
all layers of the network. 


Graph Structure in ANNs: Graph based penalties have been used in the graph 
signal processing literature [3,21,26], but are rarely used in an ANN setting. In 
the biological data setting, [14] used a graph penalty in sparse logistic regression 
on gene expression data. Another way of utilizing graph structure is through 
graph convolutional networks (GCN). GCNs are a related body of work intro- 
duced by [7], and expanded on by [19], but focus on a different set of problems 
(For an overview see [23]). GCNs require a known graph structure. We focus on 
learning a graph representation of general data. This learned graph representa- 
tion could be used as the input to a GCN similar to our MNIST example. 


3 Enforcing Graph Structure 


We consider the intra-layer relationships between neurons or larger structures 
such as capsules. For a given layer of neurons we construct a graph G = (V, E) 
with V = {v1,...,un} the set of vertices and E C V x V the set of edges. Let W 
be the weighted symmetric adjacency matrix of size N x N with Wi; = Wj > 0 
representing the weight of the edge between v; and vj. The graph Laplacian L 
is then defined as L = D — W where Dj; = Sy, Wi; and Dij = 0 for i F j. 

To enforce smoothing we use the Laplacian smoothing loss. On some activa- 
tion vector z and fixed Laplacian L we formulate the graph spectral regulariza- 
tion function G as: 


G(z,L) = z7Lz = Willa — ll (1) 


tj 


where ||- || denotes the Frobenius norm. We add it to the reconstruction or 
classification loss with a weighting term a. This adds an additional objective that 
activations should be smooth along the graph defined by L. This optimization 
procedure applies to any multi-layer model and valid graph Laplacian. We apply 
this algorithm to grid, and hierarchical graph structures on both autoencoder 
and classification dense architectures. 
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Algorithm 1. Graph Learning 

Input batches x;, model M with latent layer activations z;, regularization weight a. 
Pre-train M on x; with a = 0 
for i = 1 to T do 

Create Graph Laplacian L; from activations zi 

for j = 1 to m do 

Train M on x; with a = w and L = L; with MSE + loss in eq. 1 

end for 

end for 


3.1 Learning and Reinforcing an Abstracted Feature-Space Graph 


Instead of enforcing smoothness over a fixed graph, we can learn a feature 
graph from the data (See Algorithm 1) using neural network activations them- 
selves to bootstrap the process. Note, that most graph and kernel-based methods 
are applied over the space of observations but not over the space of features. One 
of the reasons is because it is even more difficult to define a distance between 
features than it is between observations. To circumvent this problem, we propose 
to learn a feature graph in the latent space of a neural network using feature 
co-activations as a measure of similarity. 

We proceed by creating a graph using feature activation similarity, then 
applying this graph using Laplacian smoothing for a number of iterations. This 
converges to a graph of a latent feature space at the level of granularity of the 
number of dimensions in the corresponding layer. 

Our algorithm for learning the graph consists of two phases. First, a pretrain- 
ing phase where the model is learned with no graph regularization. Second, we 
alternate between constructing the graph from the similarities of the embedding 
layer features and further training the network for reconstruction and smooth- 
ness on the graph. There are many ways to create a graph from the feature x 
datapoint activation matrix. We use an adaptive Gaussian kernel, 


1 zi — zll 1 2 — zll 
Ke) = eel — Wis) 4 Deny — ill) 


2 ; oj 


where g; is the adaptive bandwidth for node i which we set as the distance to 
the kt” nearest neighbor of feature. An adaptive bandwidth Gaussian kernel is 
necessary for general architectures as the scale of the activations is not fixed. 
Batch normalization can also be used to limit the activation scale. 

Since we are smoothing on the graph then constructing a new graph from the 
smoothed signal the learned graph converges to a steady state where the mean 
squared error acts as a repulsive force to stop the graph collapsing any further. 
We present the results of graph learning a biological dataset and show that the 
learned structure adds interpretability to the activations. 
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4 Experiments 


Through examples, we show that visualizing the activations of data on the reg- 
ularized layer highlights relationships in the data that are not easily visible 
without it. We establish this with two examples on fixed graphs, then move to 
graphs learned from the structure of the data with two examples of hierarchical 
structure and two with progression structure. 


4.1 Fixed Structure 


Enforcing fixed graph structure localizes activations for similar datapoints to a 
region of the graph. Here we show that enforcing a 8x8 grid graph on a layer of a 
dense MNIST classifier causes receptive fields to form, where each digit occupies 
a localized group of neurons on the grid. This can, in principle, be applied to 
any neural network layer to group neurons activating to similar features. Like 
in FMRI data or a convolutional neural network, we can examine the activation 
patterns for each localized group of neurons. For a second example, we show the 
usefulness in encouraging localized structure on a capsulenet architecture [18]. 
Where we are able to create globally consistent structure for better alignment 
of features between capsules. 


No Convolution Convolution Segmentation 


No Smoothing 


Graph Spectral 
Regularization 


Fig. 1. Shows average activation by digit over an (8x8) 2D grid using graph spectral 
regularization and convolutions following the regularization layer. Next, we segment 
the embedding space by class to localize portions of the embedding associated with 
each class. Notice that the digit 4 here serves as the null case and does not show up 
in the segmentation. Finally, we show the top 10% activation on the embedding of 
some sample images. For two digits (9 and 3) we show a normal input, a correctly 
classified but transitional input, and a misclassified input. The highlighted regions of 
the embedding space correlate with the semantic description of the input. 
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Enforcing Grid Structure on Mnist. Without GSR, activations are unstruc- 
tured and as a result are difficult to interpret, in that it is difficult to visually 
identify even which class a digit comes from based on the activation pattern 
(See Fig. 1). With GSR we can organize the activations making this representa- 
tion more visually distinguishable. Since we can now take this embedding as an 
image, it is possible to use a standard convolutional architecture in subsequent 
layers in order to further filter the encodings. When we add 3 layers of 3x3 2D 
convolutions with 2x2 max pooling we see that representations for each digit 
are compressed into specific areas of the image. This leads to the formation of 
receptive fields over the network pertaining to similar datapoints. Using these 
receptive fields, we can now extract the features responsible for digit classifica- 
tion. For example, features that contribute to the activation of the top right of 
our grid we can associate with those features that contribute to being the digit 9. 
The activation patterns on the embedding layer correspond well to a human 
perception of the digit type. The 9 that is misclassified as 7 both has significant 
activation in the 7 region of the embedding layer, and looks visually close to a 
7. We can now interpret the embedding layer as a sort of brain map, where the 
map can map regions of activations, to types of inputs. This is not possible in a 
standard neural network, where activations are not spatially organized. 
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Fig. 2. (a) shows the regularization structure between capsules. (b-c) Show recon- 
struction when one of the 16 dimensions in the DigitCaps representation is tweaked 
by 0.05 € [—0.25, 0.25]. (b) Without GSR each digit responds differently to perturba- 
tion of the same dimension. With GSR (c) a single dimension represents line thickness 
across all digits. 


Enforcing Node Consistency on Capsule Networks. Capsule net- 
works [18] represent the input as a set of vectors where norm denotes activa- 
tion and each component corresponds to some abstract feature. These elements 
are generally unordered. Here we use GSR to order these features consistently 
between digits. We train a capsule net on MNIST with GSR on 16 fully connected 
graphs between the 10 digit capsules. In the standard capsule network, each 
capsule orders features randomly based on initialization. However, with GSR we 
obtain a consistent feature ordering, e.g. node 1 corresponds to line thickness 
across all digits. GSR enforces a more ordered and interpretable encoding where 
localized regions are similarly organized, and the global line thickness feature is 
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consistently learned between digits. More generally, GSR can be used to order 
nodes such that features common across capsules appear together. Finally, GSR 
does not degrade performance much, as can be seen by the digit reconstructions 
in Fig. 2. 

In these examples the goal was to enforce a specified structure on unstruc- 
tured features, but next we will examine the case where the goal is to learn the 
structure of the reduced feature space. 


4.2 Learning Graph Structure 


Using the procedure defined in Sect. 3.1, we can learn a graph structure. We first 
show that depending on the data, the learned graph exhibits either cluster or 
trajectory structure. We then show that our framework can learn structures that 
are hierarchical, i.e. subclusters within clusters or trajectories within clusters. 
Hierarchies are a difficult structure for other interpretability methods to learn [6]. 
However, our method naturally captures this by allowing for arbitrary graph 
structure among neurons in a layer. 
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Fig. 3. We show the structure of the training data and snapshots of the learned graph 
for (a) three modules and (b) eight modules. (c) shows we have the mean and 95% CI 
of the number of connected components in the trained graph for over 50 trials. 


Cluster Structure on Generated Data. We structure our nt dataset to 
have exactly n feature clusters. We generate the data with n clusters by first cre- 
ating 2” data points representing the binary numbers from 0 to 2”—1, then added 
gaussian noise N(0,0.1). This creates a dataset with a ground truth number of 
feature clusters. In the nt? dataset the learned graph should have n connected 
components for n independent features. In Fig. 3 (a-b) we can see how this graph 
evolves over time for 3 and 8 modules. (c) shows how the learned graph learns 
the correct number of connected components for each ground truth number of 
clusters. 
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a) Training Time —> c) 
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Fig. 4. Shows (a) graph structure over training iterations (b) feature activations of 
parts of the trajectory. PHATE [15] embedding plots colored by (c) branch number 
and (b) inferred trajectory location showing the branching structure of the data. 


Trajectory Structure on T Cell Development Data. Next, we test graph 
learning on biological mass cytometry data, which is a high dimensional, single- 
cell protein dataset, measured on differentiating T cells from the Thymus [20]. 
The T cells lie along a bifurcating progression where the cells eventually diverge 
into two lineages (CD4+ and CD8+). Here, the structure of the data is a trajec- 
tory (as opposed to a pattern of clusters). We can see in Fig. 4 how the activated 
nodes in the graph embedding layer correspond to locations along the data tra- 
jectory, and importantly, the learned graph is a single connected component. The 
activated nodes (yellow) move from the bottom of the embedding to the top as 
T-cells develop into CD8+ cells. The CD4+ lineage is also CD8- and thus looks 
like a mixture between the CD8+ branch and the naive T cells. The learned 
graph structure here has captured the transitioning structure of the underlying 
data. 
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Fig. 5. Graph architecture, PCA plot, activation heatmaps of a standard autoencoder, 
B-VAE [9] and a graph regularized autoencoder. With relu activations normalized to 
[0, 1] for comparison. In the model with graph spectral we are able to clearly decipher 
the hierarchical structure of the data, whereas with the standard autoencoder or the 
B-VAE the structure of the data is not clear. 
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Clusters Within Clusters on Generated Data. We demonstrate graph 
spectral regularization on data that is generated with a structure containing 
sub-clusters. Our data contains three large-scale structures, each comprising 
two Gaussian sub clusters generated in 15 dimensions (See Fig.5). We use this 
dataset as it has both global and local structure. We demonstrate that our graph 
spectral regularized model is able to pick up on both the global and local struc- 
ture of this dataset where disentangling methods such as G-VAE cannot. We 
use a graph-structure layer with six nodes with three connected node pairs and 
employ the graph spectral regularization. After training, we find that each node 
pair acts as a “super node” that detects each large-scale cluster. Within each 
super node, each of the two nodes encodes one of each of the two Gaussian sub- 
structures. Thus, this specific graph topology is able to extract the hierarchical 
topology of the data. 


c) Graph Spectral Regularization 
P| 4 Connected Components 
= erie neuron 
— =a , 
= = =." 1 
=" E 
= 
=m 
2. E i 
ie a 
g Ca] = 
7 5 š 
? J : 
ha “| o = 
¥ 
A =" 
Ed 5 M A. oe 
Pies wuwMni ss 6e7oHoEHwI Ends Bt wwe 
e Nodes 


Fig. 6. Shows correlation between a set of marker genes for specific cell types and 
embedding layer activations. First with the standard autoencoder, then our autoen- 
coder with graph spectral regularization. The left heatmap is biclustered, the right 
heatmap is grouped by connected components in the learned graph. We can see pro- 
gression especially in the largest connected component where features on the right of 
the component correspond to less developed neurons. 


Hierarchical Cluster and Trajectory Structure on Developing Mouse 
Cortex Data. In Fig.6 we learn a graph on a single-cell RNA-sequencing 
dataset of over 4000 cells and over 8000 genes. The data contains a set of cells 
in the process of developing from neural stem cells to full neurons in the mouse 
brain. While there are many gene modules that contribute to the neuronal devel- 
opment, there are some states that have been studied. We use a list of cell type 
marker genes to validate our method. We use 1000 PCA components of the 
data in an autoencoder with a 20-dimensional embedding space. We learn the 
graph using an adaptive bandwidth gaussian kernel with the bandwidth for each 
feature set to the Euclidean distance to the nearest neighboring feature. 

Our graph learns six components that represent meta features over the gene 
space. We can identify each with a specific type of cell or related types of cells. 
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For example, the light green component (cluster 2) represents the very early stage 
neural stem cells as it is highly correlated with increased Aldoc, Pax6 and Sox2 
gene expression. Most interesting to examine is cluster 6, the largest component, 
which represents development into mature neurons. Within this component we 
can see a progression from just after intermediate progenitors on the left (show- 
ing Eomes expression) to more mature neurons with higher expression of Tbr1 
and Sox5. With a standard autoencoder we cannot see progression structure of 
this dataset. While some of the more global structure is captured, we fail to see 
the data progression from intermediate progenitors to mature neurons. Learning 
a graph allows us to create receptive fields e.g. clusters of neurons that corre- 
spond to specific structures within the data, in this case cell types. Within these 
neighborhoods, we can pick up on the substructure within a single cell type, i.e. 
their developmental trajectory. 


4.3 Computational Cost 


Our method can be used to increase interpretability without much loss in represen- 
tation power. At low levels, GSR can be thought of as rearranging the activations 
so that they become spatially coherent. As with other interpretability methods, 
GSR is not meant to increase representation power, but create useful representa- 
tions with low cost in power. Since GSR does not require an information bottleneck 
such as in G-VAE, a GSR layer can be very wide, while still being interpretable. In 
comparing loss of representation power, GSR should be compared to other regu- 
larization methods, namely L1 and L2 penalties (See Table 1). In all three cases we 
can see that a higher penalty reduces the model capacity. GSR affects performance 
in approximately the same way as L1 and L2 regularizations do. To confirm this, 
we rana MNIST classifier and measured train and test accuracy with 10 replicates. 
Graph spectral regularization adds a bit more overhead than elementwise activa- 
tion penalties. However, the added cost can be seen as containing one matrix vec- 
tor operation per pass. Empirically, GSR shows similar computational cost as other 
simple regularizations such as L1 and L2. To compare costs, we used a Keras model 
with Tensorflow backend [1] on a Nvidia Titan X GPU and a dual Intel(R) Xeon(R) 
CPU E5-2697 v4 @ 2.30 GHz, and with batchsize 256. we observed during training 
233 milliseconds (ms) per step with no regularization, 266 ms for GSR, and 265 ms 
for L2 penalties. 


Table 1. MNIST classification training and test accuracies for coefficient selected 
using cross validation over regularization weights in [10~’,10~®,..., 107] for various 
regularization methods with standard deviation over 10 replicates. 


Regularization | Training accuracy | Test accuracy Coefficient 


None 99.1+0.3 97.5+0.3 N/A 
Ll 98.9+0.3 97.4+0.4 1074 
L2 98.3 + 0.3 98.0 + 0.2 1074 
GSR (ours) |99.3 + 0.3 98.0 + 0.3 107 
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5 Conclusion 


We have introduced a novel biologically inspired method for regularizing features 
of the internal layers of dense neural networks to take the shape of a graph. We 
show that coherent features emerge and can be used to interpret the underlying 
structure of the dataset. Furthermore, when the intended graph is not known 
apriori, we have presented a method for learning the graph structure, which 
learns a graph relevant to the data. This regularization framework takes a step 
towards more interpretable neural networks, and has applicability for future 
work seeking to reveal important structure in real-world biological datasets as 
we have demonstrated here. 
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Abstract. Graph embedding is a technique which consists in finding a 
new representation for a graph usually by representing the nodes as vec- 
tors in a low-dimensional real space. In this paper, we compare some of 
the best known algorithms proposed over the last few years, according to 
four structural properties of graphs: first-order and second-order proxim- 
ities, isomorphic equivalence and community membership. To study the 
embedding algorithms, we introduced several measures. We show that 
most of the algorithms are able to recover at most one of the properties 
and that some algorithms are more sensitive to the embedding space 
dimension than some others. 


Keywords: Graph embedding - Network properties 


1 Introduction 


Graphs are useful to model complex systems in a broad range of domains. Among 
the approaches designed to study them, graph embedding has attracted a lot of 
interest in the scientific community. It consists in encoding parts of the graph 
(node, edge, substructure) or a whole graph into a low dimensional space while 
preserving structural properties. Because it allows all the range of data mining 
and machine learning techniques that require vectors as input to be applied to 
relational data, it can benefit a lot of applications. 

Several surveys have been recently published [5,6,8,20,21], some of them 
including a comparative study of the performance of the methods to solve spe- 
cific tasks. Among them, Cui et al. [6] propose a typology of network embedding 
methods into three families: matrix factorization, random walk and deep learn- 
ing methods. Following the same typology, Goyal et al. [8] compare state of 
the art methods on few tasks such as link prediction, graph reconstruction or 
node classification and analyze the robustness of the algorithms with respect 
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to hyper-parameters. Recently, Cai et al. [5] extended the typology by adding 
deep learning based methods without random walks but also two other families: 
graph kernel based methods notably helpful to represent the whole graph as a 
low-dimensional vector and generative models which provide a latent space as 
embedding space. For their part, Zhang et al. [21] classify embedding techniques 
into two types: unsupervised network representation learning or semi-supervised 
and they list a number of embedding methods depending on the information 
sources they use to learn. Like Goyal et al. [8], they compare the methods on 
different tasks. Finally, Hamilton et al. [10] introduce an encoder-decoder frame- 
work to describe representative embedding algorithms from a methodological 
perspective. In this framework, the encoder corresponds to the function which 
maps the elements of a graph as vectors. The decoder is a function which asso- 
ciates a specific graph statistic to the obtained vectors, for instance for a pair 
of node embeddings the decoder can give their similarity in the vector space, 
allowing the similarity of the nodes in the original graph to be quantified. 

From this last work, we retained the encoder-decoder framework and we pro- 
pose to use it for evaluating the different embedding methods. To that end, we 
compare, using metrics that we introduce, the value computed by the decoder 
with the value associated to the corresponding nodes in the graph for the equiv- 
alent function. Thus, in this paper, we adopt a different point of view from 
the previous task-oriented evaluations. Indeed, all of them consider embeddings 
as a black boz, i.e., using obtained features without considering their proper- 
ties. They ignore the fact that embedding algorithms are designed, explicitly 
or implicitly, to preserve some particular structural properties and their useful- 
ness for a given task depends on how they succeed to capture it. Thus, in this 
paper, through an experimental comparative study, we compare the ability of 
embedding algorithms to capture specific properties, i.e., first-order proximity of 
nodes, structural equivalence (second-order proximity), isomorphic equivalence 
and community structure. 

In Sect.2, these topological properties are formally defined and measures 
are introduced to evaluate to what extent embedding methods encode them. 
Section3 presents the studied embedding methods. Section4 describes the 
datasets used for the experiments, while Sect.5 presents the results. 


2 Structural Properties and Metrics 


There is a wide range of graph properties that are of interest. We propose to 
study several of them which are at the basis of network analysis and are directly 
linked with usual learning and mining tasks on graphs [13]. First, we measure 
the ability of an embedding method to recover the set of neighbors of the nodes 
which is the first-order proximity (P1). This property is important for several 
downstream tasks: clustering where vectors of the same cluster represent nodes of 
the same community, graph reconstruction where two similar vectors represent 
two nodes that are neighbors in the graph, and node classification based for 
instance on majority vote of the neighbors. Secondly, we evaluate the ability of 
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embedding methods to capture the second-order proximity (P2) which is the 
fact that two nodes have the same set of neighbors. This property is especially 
interesting when dealing with link prediction since, in social graphs, it is assumed 
that two nodes that share the same friends are likely to become friends too. 
Thirdly, we measure how much an embedding method is able to capture the roles 
of nodes in a graph which is the isomorphic equivalence (P3). This property is 
interesting when looking for specific nodes like leaders or outsiders. Finally, we 
evaluate the ability of an embedding method to detect communities (P4) in a 
graph which has been an on going field of research for the last 20 years. Next, 
we define both properties and measures we use in order to quantify how much 
an embedding method is able to capture those properties. 

Let G(V, E) be an unweighted and undirected graph where V = {vo,..., Un—1} 
is the set of n vertices, FE = {e;; ree the set of m edges and A is its binary 
adjacency matrix. Graph embedding consists in encoding the graph into a low- 
dimensional space R¢, where d is the dimension of the real space, with a function 
ff: V => Y which maps vertices to vector embeddings while preserving some 
properties of the graph. We note Y € R”*4 the embedding matrix and Y; its 
i-th row representing the node v;i. 

Neighborhood or first-order proximity (P1): capturing the neighbor- 
hood for an embedding method means that it aims at keeping any two nodes v; 
and v; that are linked in the original graph (A;; = 1) close in the embedding 
space. The measure S' designed for this property is based on the comparison 
between the set N(v;) of neighbors in the graph of every node v; and the set 
Np (vi) of its |N (v;) | nearest neighbors in the embedding space where |N (v;) | 
is its degree. Finally, by averaging over all nodes, S quantifies the ability of an 
embedding to respect the neighborhood. The higher S, the more P1 is preserved. 


se = NOM Ne (dl g = S (us) (1) 


Structural equivalence or second-order proximity (P2): two vertices 
are structurally equivalent if they share many of the same neighbors [13]. To 
measure the efficiency of an embedding method to recover the structural equiv- 
alence, we define the distance dist, (A;, Aj) between the lines of the adjacency 
matrix corresponding to each pair of nodes (v;,v;), and distg (Yi, Y;) the dis- 
tance between their representative vectors in the embedding space. The metric 
for P2 is defined by the correlation coefficient (Spearman or Pearson) Struct_eq 
between those values for all pairs of nodes. The higher Struct_eq (close to 1), 
the better P2 is preserved by the algorithm. 


La (vi, vj) = dist 4 (Ai, Aj) 5 Lg (vi, vj) = distr (Yi, Y;) (2) 


with dist the distance in the adjacency matrix (cosine or euclidean) and dist pz, 
the embedding similarity which is indicated in Table 1. Finally, 


Struct_eq = pearson(La, Le) (3) 
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Isomorphic equivalence (P3): two nodes are isomorphically equivalent, 
i.e they share the same role in the graph, if their ego-networks are isomorphic 
[4]. The ego-network of node v; is defined as the subgraph ÆN; made up of 
its neighbors and the edges between them (without v; itself). To go beyond a 
binary evaluation, for each pair of nodes (v;, vj), we compute the Graph Edit 
Distance GED (EN;,EN;) between their ego-networks EN; and EN, thanks 
to the Graph Matching Toolkit [16] and the distance between their representa- 
tive vectors in the embedding space dist (Y;, Y;). distp is indicated in Table 1. 
Finally, the Pearson and Spearman correlation coefficients between those values 
computed on all pairs of nodes are used to have an indicator for the whole graph. 
A negative correlation means that if the distance in the embedding space is large 
then exp(-GED), as in [15], is small. So, to ease one’s reading, we take the oppo- 
site of the correlation coefficient such that, for all measures, the best result is 1. 
Thus, the higher [som_eq, the better P3 is preserved by the algorithm. 


LEgonet (vi, vj) = exp(—GED(EN;,EN,)), Le (vi, 0;) = diste (Yi, Y;) (4) 


Isom_eq = —pearson(Lggonet, LE) (5) 


Community/cluster membership (P4): communities can be defined as 
“groups of vertices having higher probability of being connected to each other 
than to members of other groups” [7]. On the other hand, clusters can be defined 
as sets of elements such that elements in the same cluster are more similar to 
each other than to those in other clusters. We propose to study the ability of 
an embedding method to transfer a community structure to a cluster structure. 
Given a graph with k ground-truth communities, we cluster, using K Means (since 
k, the number of communities, is known), the node embeddings into k clusters. 
Finally, we compare this partition with the ground-truth partition using the 
adjusted mutual information (AMI). We also used the normalized mutual infor- 
mation (NMI) but both measures showed similar results. Let Loommunity be the 
ground-truth labeling and Lcjusters the one found by KMeans. 


Score = AMI(Leommunity; Letusters) (6) 


3 Embeddings 


There are many different graph embedding algorithms. We present a non- 
exhaustive list of recent methods, representative of the different families pro- 
posed in the state-of-the-art. We refer the reader to the full papers for more 
information. In Table 1 we mention all the embedding methods we used in our 
comparative study with the graph similarity they are supposed to preserve and 
the distance that is used in the embedding space to relate any pair of nodes of 
the graph. Two versions of N2V are used (A: p = 0.5,q = 4 for local random 
walks, B: p = 4,q = 0.5 for deeper random walks). 
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Table 1. Studied methods with complexity, their graph similarity (encoder) and their 
distance in the embedding space (decoder) 


Name of the method Graph sim. Embedding sim. 
Laplacian Eigenmaps (LE) [1] - O(N?) 1st-order prox Euclidean 
Locally Linear Emb. (LLE) [17] - O(N?) _ | 1st-order prox Euclidean 
HOPE [14] - O(N?) Katz-Index Dot-product 
SVD of the adjacency matrix - O(N?) 2nd-order prox Dot-product 
struc2vec (S2V) [15] - O(Nlog(N)) Co-occurence proba | Dot-product 
node2vec (N2V) [9] - O(N) Co-occurence proba | Dot-product 
Verse [18] - O(N) Perso. Page-Rank Dot-product 
Kamada-Kawai layout (KKL) [11] - O(N?) Euclidean 
Multi-dim Scaling (MDS) [12] 1st-order prox Euclidean 
SDNE [19] - O(N) 1st & 2nd-order prox | Euclidean 
4 Graphs 


To evaluate embedding algorithms, we choose real graphs and generated graphs 
having different sizes and types: random (R), with preferential attachment (PA), 
social (S), social with community structure (SC) as shown in Table 2. While real 
graphs correspond to common datasets, generators allow to control the charac- 
teristics of the graphs. Thus, we have prior knowledge which makes evaluation 
easier and more precise. Table 2 gives the characteristics of these graphs divided 
in three groups: small, medium and large graphs. 


Table 2. Dataset characteristics. All graphs are provided in our GitHub 


Name of the graph Number of nodes | Number of edges | Type 
Zachary Karate Club (ZKC) | 34 77 SC 
Erdos-Renyi (Gnp100) 100 474 R 
Barabasi-Albert (BA100) 100 900 PA 
Dancer (Dancer-100) 100 243 SC 
Email network (Email) 1133 5452 S 
Erdos-Renyi (Gnp1000) 1 000 4985 R 
Barabasi-Albert (BA1000) |1000 9900 PA 
Dancer (Dancer-1k) 1 000 3627 SC 
PGP 10 680 24316 S 
Erdos-Renyi (Gnp10000) 10 000 49722 R 
Barabasi-Albert (BA10k) 10 000 99900 PA 
Dancer (Dancer-10k) 10 000 189886 SC 
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5 Results 


We used the metrics presented in Sect.2 to quantify the ability of the embed- 
ding algorithms described in Sect.3 to recover four properties of the graphs: 
first order proximity (P1), structural and isomorphic equivalences (P2 and P3), 
community membership (P4). Due to lack of space, we show only the most rep- 
resentative results and provide the others as additional materials!. For the same 
reason, to evaluate P2 and P3, both Pearson and Spearman correlation coef- 
ficients have been computed but we only show results for Pearson as they are 
similar with Spearman. For readability, every algorithm successfully captures a 
property when its corresponding score is at 1 and 0 means unsuccessful. More- 
over, a dash (-) in a Table indicates that a method has not been able to provide 
a result. Note that due to high complexity, KKL and MDS are not computed for 
every graph. Finally, the code and datasets are available online on our GitHub 
(see footnote 1). 


5.1 Neighborhood (P1) 
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Fig. 1. Neighborhood (P1) as a function of embedding dimension. 


1 https://github.com/vaudaine/Comparing embeddings. 
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Table 3. Neighborhood (P1) Italic: Best in row. Bold: best. 


Dimensions | 2 10 100 1128 

Dimensions | 2 10 100 1000 
LE 0.086 | 0.196 | 0.371 | 0.007 

LE 0.004 | 0.097 | 0.72 0.933 
LLE 0.193 | 0.352 | 0.589 | 0.021 

LLE - - 0.045 | 0.117 
HOPE 0.022 | 0.104 | 0.177 | 0.018 

HOPE 0.002 | 0.01 | 0.226 | 0.094 
S2V 0.02 0.022 | 0.021 | 0.022 

S2V 0.001 | 0.001 | 0.001 | 0.001 
N2VA 0.044 | 0.245 | 0.37 0.437 

N2VA 0.002 | 0.032 | 0.914 | 0.945 
N2VB 0.04 0.29 0.414 | 0.45 

N2VB 0.002 | 0.045 | 0.935 | 0.935 
SDNE 0.024 | 0.047 | 0.055 | 0.041 

SDNE 0.001 | 0.001 | 0.001 | 0.001 
SVD 0.054 | 0.138 | 0.134 | 0.026 

SVD 0.001 | 0.001 | 0.001 | 0.0 
Verse 0.019 | 0.021 | 0.021 | 0.021 

Verse 0.002 | 0.052 | 0.961 | 0.854 
MDS 0.104 | 0.287 | 0.793 | 0.919 

(b) Gnp10000 
(a) Email 


For the first order proximity (P1), we measure the similarity S as a function of 
the dimension d for all the embedding methods. For computational reasons, for 
large graphs, the measure is computed on 10% of the nodes. Results are shown in 
Fig. 1 and Table 3, for d varying from 2 until approximately the number of nodes. 
We can make several observations: for networks with communities (Dancer and 
ZKC), only LE and LLE reasonably capture this property. For Barabasi Albert 
graph and Erdos-Renyi networks, Verse, MDS and LE reach scores higher than 
LLE. It means that those algorithms are able to capture this property, but are 
fooled by complex meso-scopic organizations. These results can be generalized 
as shown in additional materials. MDS can show good performance for instance 
on email dataset, Verse works only on our random graphs, LLE works only for 
ZKC and Dancer while LE seems to show good performance on every graph 
when the right dimension is chosen. In the cases of LE and LLE, there is an 
optimal dimension: the increase of the similarity as the dimension grows can be 
explained by the fact that enough information is learned; the decrease is due to 
eigen-value computation in high-dimension which is very noisy. To conclude, LE 
seems to be the best option to recover neighborhood but the right dimension 
has to be found. 


5.2 Structural Equivalence (P2) 


Concerning the second-order proximity (P2), we compute the Pearson correlation 
coefficient, as indicated in Sect. 2, as a function of the embedding space dimension 
d and we use the same sampling strategy as for property P1. 

The results are shown in Fig. 2 and Table 4. Two methods are expected to 
have good results, because they explicitly embed the structural equivalence: SVD 
and SDNE. HOPE does not explicitly embed this property but a very similar one 
which is Katz-Index. On every small graph, SVD effectively performs the best 
and with the lowest dimension. HOPE still has very good results. The Pearson 
coefficient grows as the dimension of the embedding grows which implies that 
the best results are obtained when the dimension of the space is high enough. 
The other algorithms fail to recover the structural equivalence. For medium 
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Fig. 2. Structural equivalence (P2) as a function of embedding dimension. 


and large graphs as presented in Table 4, SVD and HOPE still show very good 
performance and the higher the dimension of the embedding space, the higher 
the correlation. For large graphs, SDNE shows also very good results but it 
seems to need more data to be able to learn properly. In the end, SVD seems 
to be the best algorithm to capture the second order proximity. It computes a 
singular value decomposition which is fast and scalable but SDNE performs also 
very well on the largest graphs and, in that case, it can outperform SVD. 


5.3 Isomorphic Equivalence (P3) 


With the property P3, we investigate the ability of an embedding algorithm to 
capture roles in a graph. To do so, we compute the graph edit distance (GED) 
between every pair of nodes in the graph and the distance between the vec- 
tors of the embedding. Moreover, we sample nodes at random and compute the 
GED only between every pair of the sampled nodes thus reducing the computing 
time drastically. We sample 10% of the nodes for medium graphs and 1% of the 
nodes for large graphs. Experiments have demonstrated that results are robust to 
sampling. We present, in Fig. 3 and Table 5, the evolution of the correlation coef- 
ficient according to the dimension of the embedding space. The only algorithm 
that is supposed to perform well for this property is Struc2vec. Note also that 
algorithms which capture the structural equivalence can also give results since 
two nodes that are structurally equivalent are also isomorphically equivalent 
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Table 4. Structural equivalence (P2). Italic: Best in row. Bold: best. 

Dimensions | 2 10 100 995 

Dimensions | 2 10 100 1000 
LE 0.593 | 0.281 | 0.052 | 0.044 

LE 0.06 0.077 |0.189 | 0.192 
LLE 0.079 | —0.069 | —0.244 | -0.441 

LLE - : -0.724 | -0.785 
HOPE 0.726 | 0.909 | 0.967 | 0.947 

HOPE 0.844 0.723 | 0.799 | 0.967 
s2v 0.041 | 0.134 | 0.187 | 0.131 

S2V 0.003 | 0.457 | 0.744 | 0.717 
N2VA 0.043 | —0.038 | —0.018 | -0.033 

N2VA 0.438 0.144 | —0.289 | 0.297 
N2VB 0.05 | —0.055 | —0.042 | -0.036 

N2VB 0.445 | —0.175 | —0.342 | 0.402 
SDNE 0.174 | 0.037 | 0.034 | 0.626 

SDNE 0.678 | 0.787 | 0.952 | 0.954 
SVD 0.823 | 0.933 | 0.987 | 1.0 

SVD 0.795 0.621 (| 0.873 | 0.983 
Verse 0.036 | —0.038 | 0.023 | 0.141 

Verse —0.036 | —0.386 | —0.186 | 0.642 
MDS —0.053 | —0.015 | —0.048 | -0.079 

(b) BA10k 
(a) Dancer_1k 


but the converse is not true. For small graphs, as illustrated in Fig. 3, Struc2vec 
(S2V) is nearly always the best. It performs well on medium and large graphs 
too as shown in Table 5. However results obtained on other graphs (available in 
supplementary material) indicate that Stru2vec is not always much better than 
the other algorithms. As a matter of fact, Struc2vec remains the best algorithm 
for this measure but it is not totally accurate since the correlation coefficient is 


not close to 1 on every graph e.g on Dancer10k in Table 5(b). 


Isomorphic equivalence 


40 60 
Dimensions 


(a) BA100 


Isomorphic equivalence 


40 60 
Dimensions 


(c) Gnp100 


Isomorphic equivalence 


Isomorphic equivalence 


o 20 40 60 
Dimensions 


(b) Dancer_100 


80 


Method 

LE 
=~ LE 
n2vA 
= n2vB 
HOPE 
- verse 
sdne 
< svd 

mds 
- KKL 
s2v 


0 5 10 15 20 
Dimensions 


(d) ZKC 


Fig. 3. Isomorphic equivalence (P3) as a function of embedding dimension. 
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Table 5. Isomorphic equivalence (P3). Italic: Best in row. Bold: best. 


Dimensions | 2 10 100 995 

Dimensions | 2 10 100 1000 
LE 0.058 | 0.053 0.023 0.023 

LE —0.068 | 0.072 0.05 -0.052 
LLE 0.004 0.055 0.05 0.111 

LLE —0.088 | 0.009 —0.008 | -0.102 
HOPE 0.687 | 0.295 0.299 0.126 

HOPE 0.086 0.075 0.108 0.103 
S2V 0.468 | 0.761 | 0.759 0.753 

S2V 0.11 0.258 0.431 0.401 
N2VA 0.18 0.08 0.119 0.107 

N2VA 0.123 0.166 0.38 0.203 
N2VB 0.327 | 0.041 0.053 0.03 

N2VB 0.123 0.161 0.204 0.081 
SDNE nan 0.088 —0.057 | 0.004 

SDNE 0.057 0.083 0.035 0.086 
SVD 0.39 | 0.295 0.284 0.165 

SVD 0.053 0.076 0.1 0.102 
Verse 0.077 | —0.017 | 0.006 0.101 

Verse 0.036 —0.032 | —0.071 | — 0.148 
MDS 0.018 | —0.011 | 0.001 0.01 

(b) Dancer_10k 
(a) Gnp1000 


5.4 Community Membership (P4) 


To study the ability of an embedding to recover the community structure of a 
graph (P4), we compare, using Adjusted Mututal Information (AMI) and Nor- 
malized (NMI), the partition given by KMeans on the node embeddings and 
the ground-truth partition. The results are given only for PPG (averaged over 
3 instances) and Dancer graphs (for 20 different graphs) for which the commu- 
nity structure (ground truth) is provided by the generators. To obtain them, we 
generated planted partition graphs (PPG) with 10 communities and 100 nodes 
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Fig. 4. AMI for community detection on PPG (top) and Dancer (bottom) 
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per community. We set the probability of an edge existing between communities 
Pout = 0.01 and vary the probability of an edge existing within a community pin 
from 0.01 (no communities) to 1 (clearly defined communities), thus varying the 
modularity of the graph from 0 to 0.7. For Dancer, we generate 20 graphs with 
varying community structure by adding between-community edges and remov- 
ing within-community edges. Moreover, we apply also usual community detec- 
tion algorithms such as Louvain’s modularity maximisation (maxmod) [2] and 
Infomap [3] on the graphs. Results are shown in Fig. 4. In low dimension (d= 2, 
left of the Figure), every embedding is less efficient than the usual community 
detection algorithms. In higher dimension (d= 128, right of the Figure), many 
embedding techniques, Verse, MDS, N2V (both versions) and HOPE (on PPG), 
are able to have the same results as the best community detection algorithm: 
Louvain and obvioulsly for all the methods, AMI increases with the modularity. 


6 Conclusion 


In this paper, we studied how a wide range of graph embedding techniques pre- 
serve essential structural properties of graphs. Most of recent works on graph 
embeddings focused on the introduction of new methods and on task-oriented 
evaluation but they ignore the rationale of the methods, and only focus on their 
performance on a specific task in a particular setting. As a consequence, methods 
that have been designed to embed local structures are compared with methods 
that should embed global structures on tasks as diverse as link prediction or 
community detection. In contrast, we focused on (i) The structural properties 
for which each algorithm has been designed, and (ii) How well these proper- 
ties are effectively preserved in practice, on networks having diverse topological 
properties. As a result, we have shown that no method embed efficiently all 
properties, and that most methods embed effectively only one of them. We have 
also shown that most of recently introduced methods are outperformed or at 
least challenged by older methods specifically designed for that purpose, such 
as LE/LLE for P1, SVD for P2, and modularity optimization for P4. Finally, 
we have shown that, even when they have been designed to embed a particular 
property, most methods fail to do so in every setting. In particular, some algo- 
rithms (particularly LE and LLE) have shown an important, non-monotonous 
sensibility to the number of dimensions which can be difficult to choose in a non 
supervised context. 

In order to improve graph embedding methods, we believe that we need to 
better understand the nature of produced embeddings. We wish to pursue this 
work in two directions, (1) Understanding how those methods can obtain good 
results on tasks depending mainly on local structures, such as link prediction, 
when they do not encode efficiently local properties, and (2) study how well the 
meso-scale structure is preserved by such algorithms. 
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Abstract. Learning performance can show non-monotonic behavior. 
That is, more data does not necessarily lead to better models, even on 
average. We propose three algorithms that take a supervised learning 
model and make it perform more monotone. We prove consistency and 
monotonicity with high probability, and evaluate the algorithms on sce- 
narios where non-monotone behaviour occurs. Our proposed algorithm 
MTur makes less than 1% non-monotone decisions on MNIST while 
staying competitive in terms of error rate compared to several baselines. 
Our code is available at https://github.com/tomviering/monotone. 


Keywords: Learning curve - Model selection - Learning theory 


1 Introduction 


It is a widely held belief that more training data usually results in better gener- 
alizing machine learning models—cf. [11,17] for instance. Several learning prob- 
lems have illustrated, however, that more training data can lead to worse gen- 
eralization performance [3,9,12]. For the peaking phenomenon [3], this occurs 
exactly at the transition from the underparametrized to the overparametrized 
regime. This double-descent behavior has found regained interest in the context 
of deep neural networks [1,18], since these models are typically overparametrized. 
Recently, also several new examples have been found, where in quite simple set- 
tings more data results in worse generalization performance [10,19]. 

It can be difficult to explain to a user that machine learning models can 
actually perform worse when more, possibly expensive to collect data has been 
used for training. Besides, it seems generally desirable to have algorithms that 
guarantee increased performance with more data. How to get such a guarantee? 
That is the question we investigate in this work and for which we use learning 
curves. Such curves plot the expected performance of a learning algorithm versus 
the amount of training data.' In other words, we wonder how we can make 
learning curves monotonic. 

The core approach to make learners monotone is that, when more data is 
gathered and a new model is trained, this newly trained model is compared to 


1 Not to be confused with training curves, where the loss versus epochs (optimization 
iterations) is plotted. 
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the currently adopted model that was trained on less data. Only if the new model 
performs better should it be used. We introduce several wrapper algorithms for 
supervised classification techniques that use the holdout set or cross-validation 
to make this comparison. Our proposed algorithm MT yr uses a hypothesis test 
to switch if the new model improves significantly upon the old model. Using 
guarantees from the hypothesis test we can prove that the resulting learning 
curve is monotone with high probability. We empirically study the effect of the 
parameters of the algorithms and benchmark them on several datasets including 
MNIST [8] to check to what degree the learning curves become monotone. 

This work is organized as follows. The notion of monotonicity of learning 
curves is reviewed in Sect. 2. We introduce our approaches and algorithms in 
Sect. 3, and prove consistency and monotonicity with high probability in Sect. 4. 
Section 5 provides the empirical evaluation. We discuss the main findings of our 
results in Sect. 6 and end with the most important conclusions. 


2 The Setting and the Definition of Monotonicity 


We consider the setting where we have a learner that now and then receives data 
and that is evaluated over time. The question is then, how to make sure that the 
performance of this learner over time is monotone—or with other words, how 
can we guarantee that this learner over time improves its performance? 

We analyze this question in a (frequentist) classification framework. We 
assume there exists an (unknown) distribution P over ¥ x Y, where Æ is the 
input space (features) and Y is the output space (classification labels). To sim- 
plify the setup we operate in rounds indicated by i, where i € {1,...,n}. In 
each round, we receive a batch of samples $* that is sampled i.i.d. from P. The 
learner L can use this data in combination with data from previous rounds to 
come up with a hypothesis h; in round 7. The hypothesis comes from a hypothe- 
sis space H. We consider learners L that, as subroutine, use a supervised learner 
A : S — H, where S is the space of all possible training sets. 

We measure performance by the error rate. The true error rate on P equals 


TE / È loa (Pule) APCE) (1) 
zE yey 


where lo-1 is the zero-one loss. We indicate the empirical error rate of h on a 
sample S as ê(h, S). We call n rounds a run. The true error of the returned h; 
by the learner L in round 7 is indicated by e€;, all the e;’s of a run form a learning 
curve. By averaging multiple runs one obtains the expected learning curve, €;. 

The goal for the learner L is twofold. The error rates of the returned mod- 
els €;’s should (1) be as small as possible, and (2) be monotonically decreasing. 
These goals can be at odds with another. For example, always returning a fixed 
model ensures monotonicity but incurs large error rates. To measure (1), we 
summarize performance of a learning curve using the Area Under the Learn- 
ing Curve (AULC) [6,13,16]. The AULC averages all €,;’s of a run. Low AULC 
indicates that a learner manages to quickly reduce the error rate. 
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Monotone in round 7 means that €;41 < €i. We may care about monotonicity 
of the expected learning curve or individual learning curves. In practice, how- 
ever, we typically get one chance to gather data and submit models. In that 
case, we rather want to make sure that then any additional data also leads to 
better performance. Therefore, we are mainly concerned with monotonicity of 
individual learning curves. We quantify monotonicity of a run by the fraction of 
non-monotone transitions in an individual curve. 


3 Approaches and Algorithms 


We introduce three algorithms (learners L) that wrap around supervised learners 
with the aim of making them monotone. First, we provide some intuition how 
to achieve this: ideally, during the generation of the learning curve, we would 
check whether e(hi+1) < €(h;). A fix to make a learner monotone would be to 
output h; instead of hi+}ı if the error rate of hj, is larger. Since learners do 
not have access to e(h;), we have to estimate it using the incoming data. The 
first two algorithms, MTsımPpLE and MT yz, use the holdout method to this end; 
newly arriving data is partitioned into training and validation sets. The third 
algorithm, MTcv, makes use of cross validation. 


MTsimpue: Monotone Simple. The pseudo-code for MTstmptiez is given by 
Algorithm 1 in combination with the function UpdateSimple. Batches St are split 
into training (S$) and validation (S$). The training set S; is enlarged each round 
with S$? and a new model h; is trained. S$ is used to estimate the performance of 
hi and Apest. We store the previously best performing model, hyest, and compare 
its performance to that of h;. If the new model h; is better, it is returned and 
Apest is updated, otherwise pest is returned. 

Because h; and hpest are both compared on gi the comparison is more accu- 
rate because the comparison is paired. After the comparison Sf can safely be 
added to the training set (line 7 of Algorithm 1). 

We call this algorithm MTsīmpLe because the model selection is a bit naive: 
for small validation sets, the variance in the performance measure could be quite 
large, leading to many non-monotone decisions. In the limit of infinitely large 
S$, however, this algorithm should always be monotone (and very data hungry). 


MTur: Monotone Hypothesis Test. The second algorithm, MTyr, aims 
to resolve the issues of MTsīmpLe with small validation set sizes. In addition, 
for this algorithm, we prove that individual learning curves are monotone with 
high probability. The same pseudo-code is used as for MT srmmpie (Algorithm 1), 
but with a different update function UpdateHT. Now a hypothesis test HT 
determines if the newly trained model is significantly better than the previous 
model. The hypothesis test makes sure that the newly trained model is not better 
due to chance (such as an unlucky sample). The hypothesis test is conservative, 
and only switches to a new model if we are reasonably sure it is significantly 
better, to avoid non-monotone decisions. Japkowicz and Shah [7] provide an 
accessible introduction to understand the frequentist hypothesis testing. 
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Algorithm 1. MSIMPLE and Myr 


input: supervised learner A, rounds n, batches S* 
u € {updateSimple, updateHT} 
if u = updateHT: confidence level a, hypothesis test HT 
Sı = {} 
for i=1,...,n do 
Split S* in S? and S? 
Append to Ss : Se = [Se S$] 
Update; — u( S$, hi, hnest, a, HT) // see below 
Append to Si: St = [S:; S$] 
if Update; ori = 1 then 
| Abest = hi 
end 
Return Apest in round i 


ooN AAR WON HB 


m. e 
e O 


end 


ji 
N 


Function UpdateSimple Function UpdateHT 
input: Si, hi, Avest input: Si, hi, hvest, confidence level a, 
1 Pourrent — êlhai, SÈ) hypothesis test HT 


2 Prest — ê(hpest, So) 1 p= HT(Si, hi, Rbest)// p-value 
3 return (Peurrent < Prest) 2 return (p < alpha) 


The choice of hypothesis test depends on the performance measure. For the 
error rate the McNemar test can be used [7,14]. The hypothesis test should use 
paired data, since we evaluate two models on one sample, and it should be one- 
tailed. One-tailed, since we only want to know whether h; is better than hpest (a 
two tailed test would switch to h; if its performance is significantly different). The 
test compares two hypotheses: Ho : (hi) = e(hpest) and Ay : e(hi) < €(Abest)- 

Several versions of the McNemar test can be used [4,7,14]. We use the McNe- 
mar exact conditional test which we briefly review. Let b be the random variable 
indicating the number of samples classified correctly by Abest and incorrectly by 
h; of the sample S$, and let Na be the number of samples where they disagree. 
The test conditions on Na. Assuming Ho is true, P(b = x|Ho, Na) = (Na) (2) Ne, 
Given zx b’s, the p-value for our one tailed test is p = Xio P(b = i|Ho, Na). 

The one tailed p-value is the probability of observing a more extreme sample 
given hypothesis Ho considering the tail direction of Hı. The smaller the p-value, 
the more evidence we have for H1. If the p-value is smaller than a, we accept H1, 
and thus we update the model hpest. The smaller a, the more conservative the 
hypothesis test, and thus the smaller the chance that a wrong decision is made 
due to unlucky sampling. For the McNemar exact conditional test [4] the False 
Positive Rate (FPR, or the probability to make a Type I error) is bounded by 
a: P(p < a|Ho) < a. We need this to prove monotonicity with high probability. 
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MT cv: Monotone Cross Validation. In practice, often K-fold cross valida- 
tion (CV) is used to estimate model performance instead of the holdout. This 
is what MT cy does, and is similar to MTsmmrLe. As described in Algorithm 2, 
for each incoming sample an index J maintains to which fold it belongs. These 
indices are used to generate the folds for the K-fold cross validation. 

During CV, K models are trained and evaluated on the validation sets. We 
now have to memorize K previously best models, one for each fold. We average 
the performance of the newly trained models over the K-folds, and compare 
that to the average of the best previous K models. This averaging over folds is 
essential, as this reduces the variance of the model selection step as compared 
to selecting the best model overall (like MT simpie does). 

In our framework we return a single model in each iteration. We return the 
model with the optimal training set size that performed best during CV. This 
can further improve performance. 


Algorithm 2. Moy 
input: K folds, learner A, rounds n, batches S‘ 


1b+1 // keeps track of best round 

28={,1=0 

3 fori=1,...,n do 

4 Generate stratified CV indices for S* and put in I’. Each index i 
indicates to which validation fold the corresponding sample belongs. 

5 Append to S: S — [S; S$] 

6 Append to I: I — [J; I" 

7 for k=1,...,K do 

8 hE — A(S[I # k]) // training set of kth fold 

9 PF — é(h¥, S[I = k]) // validation set of kth fold 

10 PK — e(hk, S[I =k]) // update performance of prev. models 

11 end 

12 Update; — (mean(P*) < mean(P*)) // mean w.r.t. k 

13 if Update; ori = 1 then 

14 | bei 

15 end 

16 k — arg min, PE // break ties 

17 Return hf in round i 

1s end 


4 Theoretical Analysis 


We derive the probability of a monotone learning curve for MTsimpie and 
MTur, and we prove our algorithms are consistent if the model updates enough. 


Theorem 1. Assume we use the McNemar exact conditional test (see Sect. 3) 
with a € (0, 3j, then the individual learning curve generated by Algorithm MTur 
with n rounds is monotone with probability at least (1 — a)”. 
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Proof. First we argue that the probability of making a non-monotone decision 
in round 7 is at most a. If Hy : €(hi) < €(Rbest) or Ho : €(hi) = €(Abest) is 
true, we are monotone in round i, so we only need to consider a new alternative 
hypothesis Hə : e(hi) > €(Abest). Under Ho we have [4]: P(p < aļ|Ho) < a. 
Conditioned on H2, b is binomial with larger mean than in the case of Ho, thus 
we observe larger p-values if a € (0, 5], thus P(p < a|H2) < P(p < a|Ho) < a. 
Therefore the probability of being non-monotone in round 7 is at most a. This 
holds for any model hj, hpest and anything that happened before round 7. Since 
S$ are independent samples, being non-monotone in each round can be seen as 
independent events, resulting in (1 — a)”. 


If the probability of being non-monotone in all rounds is at most 3, we can 
set a = 1—8 z to fulfill this condition. Note that this analysis also holds for 
MT simp.e, since running MT yr with a = i results in the same algorithm as 
MT smmp.te for the McNemar exact conditional test. 

We now argue that all proposed algorithms are consistent under some con- 


ditions. First, let us revisit the definition of consistency [17]. 


Definition 1 (Consistency [17]). Let L be a learner that returns a hypothesis 
L(S) E€ H when evaluated on S. For all €excess E€ (0,1), for all distributions D 
over X x Y, for all 6 € (0,1), if there exists a n(€excess, D, ô), such that for all 
M > N(€excess, D, ô), if L uses a sample S of size m, and the following holds with 
probability (over the choice of S) at least 1 — 6, 


e(L(S)) < min e(h) + Eercess, (2) 


then L is said to be consistent. 


Before we can state the main result, we have to introduce a bit of notation. 
U; indicates the event that the algorithm updates Ayes (or in case of Mev it 
updates the variable b). Hi” to indicates the event that ~U; N 7Uj419...9 
7U;+z, or in words, that in round 7 to i + z there has been no update. To fulfill 
consistency, we need that when the number of rounds grows to infinity, the 
probability of updating is large enough. Then consistency of A makes sure that 
Apest has sufficiently low error. For this analysis it is assumed that the number 
of rounds of the algorithms is not fixed. 


Theorem 2. MTsımpLEe, MT yr and MT oy are consistent, if A is consistent 
and if for alli there exists a zi E€ N\ 0 and C; > 0 such that for all k € N\0 it 
holds that P(Hi***) < (1 — C;)¥. 


Proof. Let A be consistent with n4(€excess, D, ô) samples. Let us analyze round 
i where i is big enough such that? |.S;| > 7.4 (€excess; D, $). Assume that 


e(hpest) > mun e(h) + Eexcess) (3) 


2 In case of MTcyv, take |S+| to be the smallest training fold size in round i. 
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otherwise the proof is trivial. For any round j > i, since A produces hypothesis 
hj with |St| > n4(€excess; D, 3) samples, 

e(hj) < min e(h) + €excess (4) 
holds with probability of at least 1 — 2 Now L should update. The probability 
that in the next kz; rounds we don’t update is, by assumption, bounded by 
(1—C;)*. Since C; > 0, we can choose k big enough so that (1—C;)* < è. Thus 
the probability of not updating after kz; more rounds is at most 2, and we have 


a probability of Š that the model after updating is not good enough. Applying 
the union bound we find the probability of failure is at most 6. 


A few remarks about the assumption. It tells us, that an update is more and 
more likely if we have more consecutive rounds where there has been no update. 
It holds if each z; rounds the update probability is nonzero. A weaker but also 
sufficient assumption is V; : lim,—.. P(H/**) — 0. 

For MTgmpite and MT cy the assumption is always satisfied, because these 
algorithms look directly at the mean error rate—and due to fluctuations in the 
sampling there is always a non-zero probability that €(h;) < €(Apest). However, 
for MT yr this may not always be satisfied. Especially if the validation batches 
N, are small, the hypothesis test may not be able to detect small differences in 
error—the test then has zero power. If N, stays small, even in future rounds the 
power may stay zero, in which case the learner is not consistent. 


5 Experiments 


We evaluate MT stmpiz and MT yr on artificial datasets to understand the influ- 
ence of their parameters. Afterward we perform a benchmark where we also 
include MTcy and a baseline that uses validation data to tune the regulariza- 
tion strength. This last experiment is also performed on the MNIST dataset 
to get an impression of the practicality of the proposed algorithms. First we 
describe the experimental setup in more detail. 


Experimental Setup. The peaking dataset [3] and dipping dataset [9] are 
artificial datasets that cause non-monotone behaviour. We use stratified sam- 
pling to obtain batches $* for the peaking and dipping dataset, for MNIST we 
use random sampling. For simplicity all batches have the same size. N indicates 
batch size, and N, and N; indicate the sizes of the validation and training sets. 
As model we use least squares classification [5,15]. This is ordinary linear 
least squares regression on the classification labels {—1,+1} with intercept. For 
MNIST one-versus-all is used to train a multi-class model. In case there are 
less samples for training than dimensions, the required inverse of the covariance 
matrix is ill-defined and we resort to the Moore-Penrose Pseudo-Inverse. 
Monotonicity is calculated by the fraction of non-monotone iterations per 
run. AULC is also calculated per run. We do 100 runs with different batches 
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and average to reduce variation from the randomness in the batches. Each run 
uses a newly sampled test set consisting of 10000 samples. The test set is used 
to estimate the true error rate and is not accessible by any of the algorithms. 

We evaluate Msimpte, Myr and Mcy and several baselines. The standard 
learner just trains on all received data. A second baseline, As, splits the data in 
train and validation like Msīımpre and uses the validation data to select the opti- 
mal Lə regularization parameter A for the least square classifier. Regularization 
is implemented by adding AI to the estimate of the covariance matrix. 

In the first experiment we investigate the influence of N, and a for MTsSIMPLE 
and MT yr on the decisions. A complicating factor is that if N, changes, not 
only decisions change, but also training set sizes because S, is appended to the 
training set (see line 7 of Algorithm 1). This makes interpretation of the results 
difficult because decisions are then made in a different context. Therefore, for 
the first set of experiments, we do not add S, to the training sets, also not for 
the standard learner. For this set of experiment We use N, = 4, n = 150, d = 200 
for the peaking dataset, and we vary a and N,. 

For the benchmark, we set N; = 10, N, = 40, n = 150 for peaking and 
dipping, and we set NM; = 5, Ny = 20, n = 40 for MNIST. We fix a = 0.05 
and use d = 500 for the peaking dataset. For MNIST, as preprocessing step we 
extract 500 random Fourier-features as also done by Belkin et al. [1]. For MT cv 
we use K = 5 folds. For Ags we try À € {107°,10-*°,..., 10*°, 10°} for peaking 
and dipping, and we try À € {1073,10~?,..., 10°} for MNIST. 


Results. We perform a preliminary investigation of the algorithms MsIīMPLE 
and Myr and the influence of the parameters N, and a. We show several learning 
curves in Fig. la and d. For small N, and a we observe MTyr gets stuck: it does 
not switch models anymore, indicating that consistency could be violated. 

In Fig. 1b and e we give a more complete picture of all tried hyperparameters 
in terms of the AULC. In Fig. 1c and f we plot the fraction of non-monotone 
decisions during a run (note that the legends for the subfigures are different). 
Observe that the axes are scaled differently (some are logarithmic). In some cases 
zero non-monotone decisions were observed, resulting in a missing value due to 
log(0). This occurs for example if MTyr always sticks to the same model, then 
no non-monotone decisions are made. The results of the benchmark are shown 
in Fig. 2. The AULC and fraction of monotone decisions are given in Table 1. 


6 Discussion 


First Experiment: Tuning a and N,. As predicted MTsmpte typically 
performs worse than MTyr in terms of AULC and monotonicity unless N, is 
very large. The variance in the estimate of the error rates on S$ is so large 
that in most cases the algorithm doesn’t switch to the correct model. However, 
MTsimpie seems to be consistently better than the standard learner in terms 
of monotonicity and AULC, while MTyr can perform worse if badly tuned. 
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Fig. 1. Influence of N, and a for MTsimpie and MTur on the Peaking and Dipping 


dataset. Note that some axes are logarithmic and b, c, e, f have the same legend. 


as predicted by the theory. 
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Larger N, leads typically to improved AULC for both. a € [0.05,0.1] seems 
to work best in terms of AULC for most values of N,. If a is too small, MT yr 
can get stuck, if a is too large, it switches models too often and non-monotone 
behaviour occurs. If œ > $, MTyr becomes increasingly similar to MTsIMPLE 


The fraction of non-monotone decisions of MTyr is much lower than a. 
This is in agreement with Theorem 1, but could indicate in addition that the 
hypothesis test is rather pessimistic. The standard learner and MT s1mpte often 
make non-monotone decisions. In some cases almost 50% of the decisions are 
not-monotone. 
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Table 1. Results of the benchmark. SL is the Standard Learner. AULC is the Area 
Under the Learning Curve of the error rate. Fraction indicates the average fraction 
of non-monotone decisions during a single run. Standard deviation shown in (braces). 
Best monotonicity result is underlined. 


Peaking Dipping MNIST 
AULC Fraction AULC Fraction AULC Fraction 
SL 0.198 (0.003) 0.31 (0.02) 0.49 (0.01) 0.50 (0.03) 0.44 (0.01) 0.27 (0.04) 


MTs 0.195 (0.005) 0.23 (0.03) 0.45 (0.06) 0.37 (0.15) 0.42 (0.02) 0.41 (0.04) 
MTur 0.208 (0.009) 0.00 (0.00) 0.38 (0.08) 0.00 (0.00) 0.45 (0.02) 0.00 (0.00) 
MTcy 0.208 (0.005) 0.34 (0.03) 0.28 (0.02) 0.19 (0.08) 0.45 (0.01) 0.30 (0.06) 
As 0.147 (0.003) 0.43 (0.03) 0.49 (0.01) 0.50 (0.03) 0.36 (0.02) 0.46 (0.05) 


Second Experiment: Benchmark on Peaking, Dipping, MNIST. Inter- 
estingly, for peaking and MNIST datasets any non-monotonicity (double descent 
[1]) in the expected learning curve almost completely disappears for Ag, which 
tunes the regularization parameter using validation data (Fig. 2). We wonder if 
regularization can also help reducing the severity of double descent in other set- 
tings. For the dipping dataset, regularization doesn’t help, showing that it cannot 
prevent non-monotone behaviour. Furthermore, the fraction of non-monotone 
decisions per run is largest for this learner (Table 1). 

For the dipping dataset Mcy has a large advantage in terms of AULC. We 
hypothesize that this is largely due to tie breaking and small training set sizes due 
to the 5-folds. Surprisingly on the peaking dataset it seems to learn quite slowly. 
The expected learning curves of MTyr look better than that of MTgmrpte, 
however, in terms of AULC the difference is quite small. 

The fraction of non-monotone decisions for MTyr per run is very small as 
guaranteed. However, it is interesting to note that this does not always translate 
to monotonicity in the expected learning curve. For example, for peaking and 
dipping the expected curve doesn’t seem entirely monotone. But MTcy, which 
makes many non-monotone decisions per run, still seems to have a monotone 
expected learning curve. While monotonicity of each individual learning curves 
guarantees monotonicity in the expected curve, this result indicates monotonicity 
of each individual curve may not be necessary. This raises the question: under 
what conditions do we have monotonicity of the expected learning curve? 


General Remarks. The fraction of non-monotone decisions of MTy7r being 
so much smaller than a could indicate the hypothesis test is too pessimistic. 
Fagerland et al. [4] note that the asymptotic McNemar test can have more 
power, which could further improve the AULC. For this test the guarantee 
P(p < alHo) < a can be violated, but in light of the monotonicity results 
obtained, practically this may not be an issue. 
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MTrur is inconsistent at times, but this does not have to be problematic. If 
one knows the desired error rate, a minimum ÑN, can be determined that ensures 
the hypothesis test will not get stuck before reaching that error rate. Another 
possibility is to make the size N, dependent on i: if N, is monotonically increas- 
ing this directly leads to consistency of MTyr. It would be ideal if somehow N, 
could be automatically tuned to trade off sample size requirements, consistency 
and monotonicity. Since for CV N, automatically grows and thus also directly 
implies consistency, a combination of MTur and MT cy is another option. 

Devroye et al. [2] conjectured that it is impossible to construct a consistent 
learner that is monotone in terms of the expected learning curve. Since we look 
at individual curves, our work does not disprove this conjecture, but some of 
the authors on this paper believe that the conjecture can be disproved. One step 
to make is to get to an essentially better understanding of the relation between 
individual learning curves and the expected one. 

Currently, our definition judges any decision that increases the error rate, by 
however small amount, as non-monotone. It would be desirable to have a broader 
definition of non-monotonicity that allows for small and negligible increases of 
the error rate. Using a hypothesis test satisfying such a less strict condition could 
allow us to use less data for validation. 

Finally, the user of the learning system should be notified that non- 
monotonicity has occurred. Then the cause can be investigated and mitigated 
by regularization, model selection, etc. However, in automated systems our algo- 
rithm can prevent any known and unknown causes of non-monotonicity (as long 
as data is ii.d.), and thus can be used as a failsafe that requires no human 
intervention. 


7 Conclusion 


We have introduced three algorithms to make learners more monotone. We 
proved under which conditions the algorithms are consistent and we have shown 
for MTyr that the learning curve is monotone with high probability. If one 
cares only about monotonicity of the expected learning curve, MTsrmpLE with 
very large N, or MT cy may prove sufficient as shown by our experiments. If 
Nọ, is small, or one desires that individual learning curves are monotone with 
high probability (as practically most relevant), MTyr is the right choice. Our 
algorithms are a first step towards developing learners that, given more data, 
improve their performance in expectation. 
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Abstract. In this paper, we describe the combination of machine learn- 
ing and simulation towards a hybrid modelling approach. Such a com- 
bination of data-based and knowledge-based modelling is motivated by 
applications that are partly based on causal relationships, while other 
effects result from hidden dependencies that are represented in huge 
amounts of data. Our aim is to bridge the knowledge gap between the 
two individual communities from machine learning and simulation to 
promote the development of hybrid systems. We present a conceptual 
framework that helps to identify potential combined approaches and 
employ it to give a structured overview of different types of combinations 
using exemplary approaches of simulation-assisted machine learning and 
machine-learning assisted simulation. We also discuss an advanced pair- 
ing in the context of Industry 4.0 where we see particular further poten- 
tial for hybrid systems. 


Keywords: Machine learning - Simulation - Hybrid approaches 


1 Introduction 


Machine learning and simulation have a similar goal: To predict the behaviour 
of a system with data analysis and mathematical modelling. On the one side, 
machine learning has shown great successes in fields like image classification [21], 
language processing [24], or socio-economic analysis [7], where causal relation- 
ships are often only sparsely given but huge amounts of data are available. On the 
other side, simulation is traditionally rooted in natural sciences and engineering, 
e.g. in computational fluid dynamics [35], where the derivation of causal rela- 
tionships plays an important role, or in structural mechanics for the performance 
evaluation of structures regarding reactions, stresses, and displacements [6]. 
However, some applications can benefit from combining machine learning 
and simulation. Such an hybrid approach can be useful when the processing 
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capabilities of classical simulation computations can not handle the available 
dimensionality of the data, for example in earth system sciences [30], or when 
the behaviour of a system that is supposed to be predicted is based on both 
known, causal relationships and unknown, hidden dependencies, for example 
in risk management [25]. However, such challenges are in practice often still 
approached distinctly with either machine learning or simulation, apparently 
because they historically originate from distinct fields. This raises the question 
how these two modelling approaches can be combined into a hybrid approach 
in order to foster intelligent data analysis. Here, a key challenge in developing 
a hybrid modelling approach is to bridge the knowledge gap between the two 
individual communities, which are mostly either experts for machine learning or 
experts for simulation. Both groups have extremely deep knowledge about the 
methods used in their particular fields. However, the respectively used termi- 
nologies are different, so that an exchange of ideas between both communities 
can be impeded. 

Related work that describes a combination of machine learning with simula- 
tion can roughly be divided in two groups, not surprisingly, either from a machine 
learning or a simulation point of view. The first group frequently describes the 
integration of simulation into machine learning as an additional source for train- 
ing data, for example in autonomous driving [23], thermodynamics [19], or bio- 
medicine [13]. A typical motivation is the augmentation of data for scenarios 
that are not sufficiently represented in the available data. The second group of 
related works describes the integration of machine learning techniques in sim- 
ulation, often for a specific application, such as car crash simulation [6], fluid 
simulation [38], or molecular simulation [26]. A typical motivation is to iden- 
tify surrogate models [16], which offer an approximate but cheaper to evaluate 
model to replace the full simulation. Another technique that is used to adapt 
a dynamical simulation model to new measurements is data assimilation, which 
is traditionally used in weather forecasting [22]. Related work that considers an 
equal combination of machine learning and simulation is quite rare. A work that 
is closest to describing such a hybrid, symbiotic modelling approach is [4]. 

More general, the integration of prior knowledge into machine learning can be 
described as informed machine learning [34] or theory-guided data science [18]. 
The paper [34] presents a survey with a taxonomy that structures approaches 
according to the knowledge type, representation, and integration stage. We reuse 
those categories in this paper. However, that survey considers a much broader 
spectrum of knowledge representations, from logic rules over simulation results 
to human interaction, while this paper puts an explicit focus on simulations. 

Our goal is to make the key components of the two modelling approaches 
machine learning and simulation transparent and to show the versatile, potential 
combination possibilities in order to inspire and foster future developments of 
hybrid systems. We do not intend to go into technical details but rather give a 
high-level methodological overview. With our paper we want to outline a vision 
of a stronger, more automated interplay between data- and simulation-based 
analysis methods. We mainly aim our findings at the data analysis and machine 
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Fig. 1. Subfields of Combining Machine Learning and Simulation. The fields 
of machine learning and simulation have an intersecting area, which we partition into 
three subfields: 1. Simulation-assisted machine learning describes the integration of 
simulations into machine learning. 2. Machine-learning assisted simulation describes 
the integration of machine learning into simulation. 3. A hybrid combination describes 
a combination of machine learning and simulation with a strong mutual interplay. 


learning community, but also those from the simulation community are welcome 
to read on. Generally, our target audience are researchers and users of one of the 
two modelling approaches who want to learn how they can use the other one. 

The contributions of this paper are: 1. A conceptual framework serving as an 
orientation aid for comparing and combining machine learning and simulation, 
2. a structured overview of combinations of both modelling approaches, 3. our 
vision of a hybrid approach with a stronger interplay of data- and simulation 
based analysis. 

The paper is structured as follows: In Sect. 2 we give a brief overview of the 
subfields that result from combining machine learning and simulation. In Sect. 3 
we present these two separate modelling approaches along our conceptual frame- 
work. In Sect. 4 we describe the versatile combinations by giving exemplary refer- 
ences and applications. In Sect. 5 we further discuss our observations in Industry 
4.0 projects that lead us to a vision for the advanced pairing of machine learning 
and simulation. Finally we conclude in Sect. 6. 


2 Overview 


In this section, we give a short overview about the subfields that result from a 
combination of machine learning with simulation. We view the combination with 
equal focus on both fields, driving our vision of a hybrid modelling approach 
with a stronger and automated interplay. Figure 1 illustrates our view on the 
fields’ overlap, which can be partitioned into the three subfields simulation- 
assisted machine learning, machine-learning assisted simulation, and a hybrid 
combination. Even though the first two can be regarded as one-sided approaches 
because they describe the integration with a point of view from one approach, 
the last one can be regarded as a two-sided approach. Although the term hybrid 
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Machine Learning 


1. Model Generation Phase: Learning an Inductive Model 


Training Data 


Hypothesis Set 


—/— Final Hypothesis 


2. Model Application Phase: Inference / Prediction 


Fig. 2. Components of Machine Learning. Machine Learning consists of two 
phases 1. model generation, and 2. model application, where the focus is usually made 
on the first phase, in which an inductive model is learned from data. The compo- 
nents of this phase are the training data, a hypothesis set, a learning algorithm, and 
a final hypothesis [1,34]. It describes the finding of patterns in an initially large data 
space, which are finally represented in a condensed form by the final hypothesis. This 
is illustrated by the reversed triangle and can be described as a “bottom-up approach” . 


is in the literature often used for the above one-sided approaches, we prefer to 
use it only for the two-sided approach where machine learning and simulation 
have a strong mutual, symbiotic-like interplay. 


3 Modelling Approaches 


In this section, we describe the two modelling approaches by means of a concep- 
tual framework that aims to make them and their components transparent and 
comparable. 


3.1 Machine Learning 


The main goal of machine learning is that a machine automatically learns a 
model that describes patterns in given data. The typical components of machine 
learning are illustrated in Fig. 2. In the first, main phase an inductive model is 
learned. Inductive means that the model is built by drawing conclusions from 
samples and is thus not guaranteed to depict causal relationships, but can instead 
identify hidden, previously unknown patterns, meaning that the model is usually 
not knowledge-based but rather data-based. This inductive model can finally be 
applied to new data in order to predict or infer a desired target variable. 

The model generation phase can be roughly split into four sub-phases or 
respective components [1,34]. Firstly, training data is prepared that depicts his- 
torical records of the investigated process or system. Secondly, a hypothesis set 
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Simulation 
1. Model Generation Phase: Identifying a Deductive Model 
2. Model Application Phase: Running a Simulation 


Model Model 
Parameter 
Misnedesl Method 
Data ae S culation Result. i 


Fig. 3. Components of Simulation. Simulation comprises the two phases 1. model 
generation, and 2. model application, where the focus often is on the second phase, in 
which an earlier identified deductive model is used in order to create simulation results. 
The components of this phase are the simulation model, input parameters, a numerical 
method, and the simulation result. It describes the unfolding of local interactions from 
a compactly represented initial model into an expanded data space. This is supposed 
be illustrated by the triangle and can be described as a “top-down approach”. 


is defined in the form of a function class or network architecture that is assumed 
to map input features to the target variables. Thirdly, a learning algorithm 
tunes the parameters of the hypothesis set so that the performance of the map- 
ping is maximized by using optimization algorithms like gradient descent and 
results in, fourthly, the final hypothesis, which is the desired inductive model. 
This model generation phase is often repeated in a loop-like manner by tuning 
hyper-parameters until a sufficient model performance is achieved. 


3.2 Simulation 


The goal of a simulation is to predict the behaviour of a system or process 
for a particular situation. There are different types of simulations, ranging 
from cellular automata, over agent-based simulations, to equation-based sim- 
ulations [9,15,36]. In the following we concentrate on the last type, which is 
based on mathematical models and is especially used in science and engineer- 
ing. The first, required stage preceding the actual simulation is the identification 
of a deductive model, often in the form of differential equations. Deductive in 
this context means that the model describes causal relationships and can thus 
be called knowledge-based. Such models are often developed through extensive 
research, starting with a derivation, for example in theoretical physics, and con- 
tinuing with plentiful experimental validations. Some recent research exists of 
proof-of-concepts for identifying models directly from data [8,33]. 

The main phase of a simulation is the application of the identified model 
for a specific scenario, often called running a simulation. This phase can be 
described in four typical main components or sub-phases, which are, as illus- 
trated in Fig. 3, the mathematical model, the input parameters, the numerical 
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Simulation-Assisted Machine Learning 
Integration of Simulation Results in: 


SF. Training Data | ey Hypothesis Set = L. Algorithm SE Final Hypothesis 


Fig. 4. Types of Simulation-Assisted Machine Learning. Simulations, in par- 
ticular the simulation results, can be generally integrated into the four different com- 
ponents of machine learning. The triangles illustrate the machine learning (blue/dark 
gray) or the simulation (orange/light gray) approach and their components, which are 
themselves presented in Figs. 2 and 3. The simulation results can be used to (a) aug- 
ment the training data, (b) define parts of the hypothesis set in the form of empirical 
functions, (c) steer the training algorithm in generative adversarial networks, or (d) 
verify the final hypothesis against scientific consistency. (Color figure online) 


method, and finally the simulation result [36]. After the selection of a math- 
ematical model, the input parameters that describe the specific scenario are 
defined in the second sub-phase. They can comprise general parameters such 
as the spatial domain or time of interest, as well as initial conditions quanti- 
fying the systems’ or processes’ initial status and boundary conditions defining 
the behaviour at domain borders. In the third sub-phase, a numerical method 
computes the solution of the given model observing the constraints resulting 
from the input parameters. Examples for numerical methods are finite differ- 
ences, finite elements or finite volume methods for spatial discretization [36], or 
particle methods based on interaction forces [26]. These form the basis for an 
approximate solution, which is the final simulation result. This model application 
phase is often repeated in a loop-like manner, e.g., by tuning the discretization 
to achieve a desired approximation accuracy and stability of the solution. 


4 Combining Machine Learning and Simulation 


In this section, we describe combinations of machine learning and simulation 
by using our conceptual framework from Sect. 3. Here, we focus on simulation- 
assisted machine learning and machine-learning assisted simulation. For each of 
the methodical combination types, we give exemplary application references. 


4.1 Simulation-Assisted Machine Learning 


Simulation offers an additional source of information for machine learning that 
goes beyond typically available data and that is rich of knowledge. This addi- 
tional information can be integrated into the four components of machine learn- 
ing as illustrated in Fig. 4. In the following, we will give an overview about these 
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integration types by giving for each an illustrative example and refer for a more 
detailed discussion to [34]. 

Simulations are particularly useful for creating additional training data in 
a controlled environment. This is for example applied in autonomous driving, 
where simulations such as physics engines are employed to create photo-realistic 
traffic scenes, which can be used as synthetic training data for learning tasks like 
semantic segmentation [14], or for adversarial test generation [40]. As another 
example, in systems biology, simulations can be integrated in the training data 
of kernelized machine learning methods [13]. 

Moreover, simulations can be integrated into the hypothesis set, either 
directly as the solvers or through deduced, empirical functions that compactly 
describe the simulations results. These functions can be built into the architec- 
ture of a neural network, as shown for the application of finding an optimal 
design strategy for a warm forming process [20]. 

The integration of simulations into the learning algorithm can for example 
be realized by generative adversarial networks (GANs), which learn a prediction 
function that obeys constraints, which might be unknown but are implicitly 
given through a simulation [31]. 

Another important integration type is in the validation of the final hypothesis 
by simulations. An example for this comes from material discovery, where first 
a machine learning model suggests new compounds based on patterns in a data 
basis, and second the physical properties are computed and thus checked by a 
density functional theory simulation [17]. 

An approach that uses simulations along the whole machine learning pipeline 
is reinforcement learning (RL), when the model is learned in a simulated envi- 
ronment [2]. Studies under the keyword “sim-to-real” are often concerned with 
robots learning to grip or move unknown objects in simulations and usually 
require retraining in reality. An application for controlling the temperature of 
plasma follows the analogous approach, i.e., a training based on a software- 
physics model, where the learned RL model is then further adapted for use in 
reality [41]. 


4.2 Machine-Learning Assisted Simulation 


Machine learning is often used in simulation with the intention to support the 
solution process or to detect patterns in the simulation data. With respect to our 
conceptual framework presented in Sect. 3, machine learning techniques can be 
used for the initial model, the input parameters, the numerical method, and the 
final simulation results, as illustrated in Fig. 4. In the following we will give an 
overview about the integration types. Again, we do not intend to cover the full 
spectrum of machine-learning assisted simulation, we rather want to illustrate 
its diverse approaches through representative examples. 

A prominent integration type of machine learning techniques into simulation 
is the identification of simpler models, such as surrogate models [11,12, 16,26]. 
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Machine-Learning Assisted Simulation 
Integration of Machine Learning Model in: 


a) Model b) Inp. Parameter | c) Numerical M. | d) Simulation Results 


| EEN o 


Fig. 5. Types of Machine-Learning Assisted Simulation. Machine learning tech- 
niques, in particular the final hypothesis, can be used in different simulation compo- 
nents. The triangles illustrate the machine learning (blue/dark gray) or the simulation 
(orange/light gray) approach and their components, which are themselves explained 
in Figs. 2 and 3. Exemplary use cases for machine learning models in simulation are 
(a) model order reduction and the development of surrogate models that offer approxi- 
mate but simpler solutions, (b) the automated inference of an intelligent choice of input 
parameters for a next simulation run, (c) a partly trainable solver for differential equa- 
tions, or d) the identification of patterns in simulation results for scientific discovery. 
(Color figure online) 


These are approximate and cheap to evaluate models that are particularly of 
interest when the solution of the original, more precise model is very time- or 
resource-consuming. The surrogate model can then be used to analyse the over- 
all behaviour of the system in order to reveal scenarios that should be further 
investigated with the detailed original simulation model. Such surrogate models 
can be developed with machine-learning techniques either with data from real- 
world experiments, or with data from high-fidelity simulations. One application 
example is the optimization of process parameters using deep neural networks 
as surrogate models [27]. Kernel-based approaches are also commonly used as 
surrogate models for simulations, an example to improve the energetic efficiency 
of a gas transport network is shown in [10]. A well-established approach for sur- 
rogate modelling is model order reduction, for example with proper orthogonal 
decomposition, which is closely related to principal component analysis [5,37]. 

Data assimilation, which includes the calibration of constitutive models and 
the estimation of system states, is another area where machine learning tech- 
niques enhance simulations. Data assimilation problems can be modelled using 
dynamic Bayesian networks with continuous physically interpretable state spaces 
where the evaluation of transition kernels and observation operators requires 
forward-simulation runs [29]. 

Machine learning techniques can also be used to study the parameter depen- 
dence of simulation results. For example, after an engineer executes a sequence of 
simulations, a machine learning model can detect different behavioral modes in 
the results and thus reduce the analysis effort during the engineering process [6]. 
This supports the selection of the parameter setting for the next simulation, for 
which active learning techniques can also be employed. For example, [39] studied 
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it for selecting the molecules for which the internal energy shall be determined 
by computationally expensive quantum-mechanical calculations, as well as for 
determining a surrogate model for the fluid flow in a well-bore while drilling. 

The integration of machine learning techniques into the numerical method 
can support to obtain the numerical solution. One approach is to exchange parts 
of the model that are resource-consuming to solve, with learned models that can 
be computed faster, for example with machine learning generated force fields in 
molecular dynamics simulations [26]. Another approach that is recently investi- 
gated are trainable solvers for partial differential equations that determine the 
complete solution through a neural network [28]. 

A further, very important integration type is the application of machine 
learning techniques on the simulation results in order to detect patterns, often 
motivated by the goal of scientific discovery. While there are plenty of applica- 
tion domains, two exemplary representatives are particle physics [3] and earth- 
sciences, for example with the use of convolutional neural networks for the detec- 
tion of weather patterns on climate simulation data [30]. For further examples we 
refer to a survey about explainable machine learning for scientific discovery [32]. 


5 Advanced Pairing of Machine Learning and Simulation 


Section 4 gave a brief overview of the versatile existing approaches that integrate 
aspects of machine learning into simulation and vice versa, or that combine sim- 
ulation and machine learning sequentially. Yet, we think that the integration of 
these two established worlds is only at the beginning, both in terms of modelling 
approaches and in terms of available software solutions. 

In the following, we describe a number of observations from our project expe- 
rience in the development of cyber-physical systems for Industry 4.0 applications 
that support this assessment. Note that the key technical goal of Industry 4.0 is 
the flexibilization of production processes. In addition to the broad integration of 
digital equipment in the production machinery, a key provider of flexibilization 
is a decrease of process design and dimensioning times and ideally, a merging 
of planning and production phase that are today still strictly separated. This 
requires a new generation of computer-aided engineering (CAE) software sys- 
tems that allow for very fast process optimization cycles with real time feedback 
loops to the production machinery. An advanced pairing of machine learning 
and simulation will be key to realize such systems by addressing the following 
issues: 


— Simulation results are not fully exploited: Especially in the indus- 
trial practice, simulations are run with a very specific analysis goal based 
on expert-designed quantities of interest. This ignores that the simulation 
result might reveal more patterns and regularities, which might be irrelevant 
for the current analysis goal but useful in other contexts. 

— Selective surrogate modelling: Even if modern machine learning 
approaches are used, surrogate models are built for very specific purposes 
and the decision when and where to use a surrogate model is left to domain 
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experts. In this way, it is exploited too little that similar underlying systems 
might lead to similar surrogate models and in consequence, too many costly 
high-fidelity simulations are run to generate the data basis, although parts of 
the learned surrogate models could be transferred. 

— Parameter studies and simulation engines: Parameter and design stud- 
ies are well-established tools in many fields of engineering. Surprisingly, the 
frameworks to conduct these studies and to build the surrogate models are 
third-party solutions that are separated from the core simulation engines. For 
the parameter study framework, the simulation engine is a black box, which 
does not know that it is currently used for a parameter study. In turn, the 
standard rules to generate sampling points in the parameter space are not 
aware about the internals of the simulation engine. This raises the question 
how much more efficient parameter studies could be conducted so that both 
software systems were stronger connected to each other. 


These observations lead us to a research concept that we propose in this 
paper and call it learning simulation engines. A learning simulation engine 
is a hybrid system that combines machine learning and simulation in an opti- 
mal way. Such an engine can automatically decide when and where to apply 
learned surrogate models or high-fidelity simulations. Surrogate models are effi- 
ciently organized and re-used through the use of transfer learning. Parameter and 
design optimization is an integral component of the learning simulation engine 
and active learning methods allow the efficient re-use of costly high-fidelity com- 
putations. 

Of course, the vision of a learning simulation engine raises numerous research 
questions. We describe some of them in view of Fig. 1. First of all, the question 
is how learning and simulation can be technically combined to such an advanced 
hybrid approach, especially, if they can only be integrated into each other by 
using the final simulation results and the final hypothesis (as shown in Figs. 4 
and 5), or if they can also be combined at an earlier sub-phase. Moreover, the 
counterparts of the learning’s model generation phase and the simulation’s model 
application phase (see Figs.2 and 3) should be investigated further in order 
to better understand the similarities and differences to the simulation’s model 
generation phase and a learning’s model application phase. 


6 Conclusion 


In this paper, we described the combination of machine learning and simulation 
motivated by fostering intelligent analysis of applications that can benefit from 
a combination of data- and knowledge-based solution approaches. 

We categorized the overlap between the two fields into three sub-fields, 
namely, simulation-assisted machine learning, machine-learning assisted simu- 
lation, and a hybrid approach with a strong and mutual interplay. We presented 
a conceptual framework for the two separate approaches, in order to make them 
and their components transparent for the development of a potential combined 
approach. In summary, it describes machine learning as a bottom-up approach 
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that generates an inductive, data-based model and simulation as a top-down 
approach that applies a deductive, knowledge-based model. Using this concep- 
tual framework as an orientation aid for their integration into each other, we 
gave a structured overview about the combination of machine learning and sim- 
ulation. We showed the versatility of the approaches through exemplary methods 
and use cases, ranging from simulation-based data augmentation and scientific 
consistency checking of machine learning models, to surrogate modelling and 
pattern detection in simulations for scientific discovery. Finally, we described 
the scenario of an advanced pairing of machine learning and simulation in the 
context of Industry 4.0 where we see particular further potential for hybrid 
systems. 
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Abstract. In multi-label classification (MLC), each instance is associ- 
ated with a set of class labels, in contrast to standard classification, where 
an instance is assigned a single label. Binary relevance (BR) learning, 
which reduces a multi-label to a set of binary classification problems, 
one per label, is arguably the most straight-forward approach to MLC. 
In spite of its simplicity, BR proved to be competitive to more sophisti- 
cated MLC methods, and still achieves state-of-the-art performance for 
many loss functions. Somewhat surprisingly, the optimal choice of the 
base learner for tackling the binary classification problems has received 
very little attention so far. Taking advantage of the label independence 
assumption inherent to BR, we propose a label-wise base learner selection 
method optimizing label-wise macro averaged performance measures. In 
an extensive experimental evaluation, we find that or approach, called 
LiBRe, can significantly improve generalization performance. 


Keywords: Multi-label classification - Algorithm selection - Binary 
relevance 


1 Introduction 


By relaxing the assumption of mutual exclusiveness of classes, the setting of 
multi-label classification (MLC) generalizes standard (binary or multinomial) 
classification—subsequently also referred to as single-label classification (SLC). 
MLC has received a lot of attention in the recent machine learning literature [23, 
29]. The motivation for allowing an instance to be associated with several classes 
simultaneously originated in the field of text categorization [19], but nowadays 
multi-label methods are used in applications as diverse as image processing [4, 26] 
and video annotation [14], music classification [18], and bioinformatics [2]. 
Common approaches to MLC either adapt existing algorithms (algorithm 
adaptation) to the MLC setting, e.g., the structure and the training procedure 
for neural networks, or reduce the original MLC problem to one or multiple SLC 
problems (problem transformation). The most intuitive and straight-forward 
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problem transformation is to decompose the original task into several binary 
classification tasks, one per label. More specifically, each task consists of train- 
ing a classifier that predicts whether or not a specific label is relevant for a query 
instance. This approach is called binary relevance (BR) learning [3]. Beyond BR, 
many more sophisticated strategies have been developed, most of them trying 
to exploit correlations and interdependencies between labels [28]. In fact, BR 
is often criticized for ignoring such dependencies, implicitly assuming that the 
relevance of one label is (statistically) independent of the relevance of another 
label. In spite of this, or perhaps just because of this simplification, BR proved to 
achieve state-of-the-art performance, especially for so-called decomposable loss 
functions, for which its optimality can even be corroborated theoretically [7,9]. 

Techniques for reducing MLC to SLC problems involve the choice of a 
base learner for solving the latter. Somewhat surprisingly, this choice is often 
neglected, despite having an important influence on generalization performance 
[10-12,15]. Even in more extensive studies [10,12], a base learner is fixed a 
priori in a more or less arbitrary way. Broader studies considering multiple 
base learners, such as [6,22], are relatively rare and rather limited in terms 
of the number of base learners considered. Only recently, greater attention to 
the choice of the base learner has been paid in the field of automated machine 
learning (AutoML) [17, 24,25], where the base learner is considered as an impor- 
tant “hyper-parameter” to tune. Indeed, while optimizing the selection of base 
learners is laborious and computationally expensive in general, which could be 
one reason for why it has been tackled with reservation, AutoML now offers new 
possibilities in this direction. 

Motivated by these opportunities, and building on recent AutoML methodol- 
ogy, we investigate the idea of base learner selection for BR in a more systematic 
way. Instead of only choosing a single base learner to be used for all labels simul- 
taneously, we even allow for selecting an individual learner for each label (i.e., 
each binary classification task) separately. In an extensive experimental study, 
we find that customizing BR in a label-wise manner can significantly improve 
generalization performance. 


2 Miaulti-label Classification 


The setting of multi-label classification (MLC) allows an instance to belong to 
several classes simultaneously. Consequently, several class labels can be assigned 
to an instance at the same time. For example, a single image could be tagged 
with labels Sun and Beach and Sea and Yacht. 


2.1 Problem Setting 


To formalize this learning problem, let Y denote an instance space and £ = 
{A1,---,Am} a finite set of m class labels. An instance x € Æ is then (non- 
deterministically) associated with a subset of class labels L € 2“. The subset L 
is often called the set of relevant labels, while its complement £ \ L is considered 
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irrelevant for æ. Furthermore, a set L of relevant labels can be identified by a 
binary vector y = (y1,---,;Ym) where y; = 1 if A; € L and y; = 0 otherwise (i.e., 
if A; € £\ L). The set of all label combinations is denoted by Y = {0,1}”™. 
Generally speaking, a multi-label classifier h is a mapping h : ¥ — Yy 
returning, for a given instance a € X, a prediction in the form of a vector 


h(a) = (hi(a), ho(x),...,hm(x)). 


The MLC task can be stated as follows: Given a finite set of observations as 
training data Derain — (Xtrains Yirain) = { (ais Yi) ha C aN x ye the goal is 
to learn a classifier h : X — Y that generalizes well beyond these observations 
in the sense of minimizing the risk with respect to a specific loss function. 


2.2 Loss Functions 


A wide spectrum of loss functions has been proposed for MLC, many of which are 
generalizations or adaptations of losses for single-label classification. In general, 
these loss functions can be divided into two major categories: instance-wise and 
label-wise. While the latter first compute a loss for each label and then aggregate 
the values obtained across the labels, e.g., by taking the mean, instance-wise loss 
functions first compute a loss for each instance and subsequently aggregate the 
losses over all instances in the test data. As an obvious advantage of label-wise 
loss functions, note that they can be optimized by optimizing a standard SLC loss 
for each label separately. In other words, label-wise losses naturally harmonize 
with label-wise decomposition techniques such as BR. Since this allows for a 
simpler selection of the base learner per label, we focus on two such loss functions 
in the following. For additional details on MLC and loss functions, especially 
instance-wise losses, we refer to [23,29]. 

Let: Drest = (Xtest, Yrest) = 1 (ee, Y} C XS x VË be a test set of size S. 
Farther, let H = (h(x1),...,h(ag)) c Y%. Then, the Hamming loss, which can 
be seen as a generalized form of the error rate, is defined! as 


m S 
Lu Yes H) = > > [ois AED] - (1 


j=l i=l 


Moreover, the label-wise macro-averaged F-measure (which is actually a measure 
of accuracy, not a loss function, and thus to be maximized) is given by 


m S 
keane S DDR, 5 

ees ~ S S i 
oh j=1 i= Vij + Xa hist) 


Obviously, to optimize the measures (1) and (2), it is sufficient to optimize each 
label individually, which corresponds to optimizing the inner term of the (first) 
sum. 


1 [-] is the indicator function. 
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2.3 Binary Relevance 


As already said, binary relevance learning decomposes the MLC task into several 
binary classification tasks, one for each label. For every such task, a single-label 
classifier, such as an SVM, random forest, or logistic regression, is trained. More 
specifically, a classifier for the j‘” label is trained on the dataset {(a;, yi,;) }4_,. 
Formally, BR induces a multi-label predictor 


BR, : X — Y, «+ (b1(@),bo(@),...,bm(2)) , 


where b; : ¥ — {0,1} represents the prediction of the base learner for the jt” 
label. 


3 Related Work 


Binary relevance has been subject to modifications in various directions, an 
excellent overview of which is provided in a recent survey [28]. Extensions of 
BR mainly focus on its inability to exploit label correlations, due to treating 
all labels independently of each other. Three types of approaches have been 
proposed to overcome this problem. The first is to use classifier chains [15]. In 
this approach, one first defines a total order among the m labels and then trains 
binary classifiers in this order. The input of the classifier for the i’” label is the 
original data plus the predictions of all classifiers for labels preceding this label in 
the chain. Similarly, in addition to the binary classifiers for the m labels, stacking 
uses a second layer of m meta-classifiers, one for each label, which take as input 
the original data augmented by the predictions of all base learners [11,21]. A 
third approach seeks to capture the dependencies in a Bayesian network, and 
to learn such a network from the data [1,20]. One can then use probabilistic 
inference to compute the probability for each possible prediction. 

Another line of research looks at how the problem of imbalanced classes can 
be addressed using BR. Class imbalance constitutes an important challenge in 
multi-label classification in general, since most labels are usually irrelevant for 
an instance, i.e., the overwhelming majority of labels in a binary task is negative. 
Using BR, the imbalance can be “repaired” in a label-wise manner, using tech- 
niques for standard binary classification, such as sampling [5] or thresholding 
the decision boundary [13]. An approach taking dependencies among labels into 
account (and hence applied prior to splitting the problem) is presented in [27]. 

To the best of our knowledge, this is the first approach in which the base 
learner used for the different labels is subject to optimization itself. In fact, 
except for AutoML tools, we are not even aware of an approach optimizing a 
single base learner applied to all labels. In all the above approaches, the choice 
of the base learners is an external decision and not part of the learning problem 
itself. 
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4 lLabel-Wise Selection of Base Learners 


As already stated before, while various attempts at improving binary relevance 
learning by capturing label dependencies have been made, the choice of the 
base learner for tackling the underlying binary problems—as another potential 
source of improvement—has attracted much less attention in the literature so 
far. If considered at all, this choice has been restricted to the selection of a single 
learner, which is applied to all m binary problems simultaneously. 

We proceed from a portfolio of base learners 


A:= {a |a: (&¥” x {0,1}") — ( — {0,1})}. 


Then, given training data Dtrain = (Xtrain, Ytrain), the objective is to find the 
base learner a for which BR performs presumably best on test data Dest = 
(Xtest, Yrest) with respect to some loss function £: 


arg min L (Yrest BR, (Xtest)); with bj =a Care Ae 3 (3) 
acA 
where VAVA denotes the j*” column of the label matrix Yirain- 


Moreover, we propose to leverage the independence assumption underlying 
BR to select a different base learner for each of the labels, and refer to this 
variant as LiBRe. We are thus interested in solving the following problem: 

arg a L (Yrest, BR, (Xtest)), with bj := Qj (Xeain, ve) : (4) 
Compared to (3), we thus significantly increase flexibility. In fact, by taking 
advantage of the different behavior of the respective base learners, and the ability 
to model the relationship between features and a class label differently for each 
binary problem, one may expect to improve the overall performance of BR. On 
the other side, the BR learner as a whole is now equipped with many degrees of 
freedom, namely the choice of the base learners, which can be seen as “hyper- 
parameters” of LiBRe. Since this may easily lead to undesirable effects such 
as over-fitting of the training data, an improvement in terms of generalization 
performance (approximated by the performance on the test data) is by no means 
self-evident. From this point of view, the restriction to a single base learner in (3) 
can also be seen as a sort of regularization. Such kind of regulation can indeed 
be justified for various reasons. In most cases, for example, the binary problems 
are indeed not completely different but share important characteristics. 

Computationally, (4) may appear more expensive than choosing a single base 
learner jointly for all the labels, at least at first sight. However, the complexity in 
terms of the number of base learners to be evaluated remains exactly the same. 
In fact, just like in (3), we need to fit a BR model for every base learner exactly 
once. The only difference is that, instead of picking one of the base learners for 
all labels in the end, LiBRe assembles the base learners performing best for the 
respective labels (recall that we head for label-wise decomposable performance 
measures). 
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5 Experimental Evaluation 


This section presents an empirical evaluation of LiBRe, comparing it to the 
use of a single base learner as a baseline. We first describe the experimental 
setup (Sect. 5.1), specify the baseline with the single best base learner (Sect. 5.2), 
and define the oracle performance (Sect.5.3) for an upper bound. Finally, the 
experimental results are presented in Sect. 5.4. 


5.1 Experimental Setup 


For the evaluation, we considered a total of 24 MLC datasets. These datasets 
stem from various domains, such as text, audio, image classification, and biology, 
and range from small datasets with only a few instances and labels to larger 
datasets with thousands of instances and hundreds of labels. A detailed overview 
is given in Tablel, where, in addition to the number of instances (#1) and 
number of labels (#L), statistics regarding the label-to-instance ratio (L2IR), the 
percentage of unique label combinations (ULC), and the average label cardinality 
(card.) are given. 

The train and validation folds were derived by conducting a nested 2-fold 
cross validation, i.e., to assess the test performance we have an outer loop of 2- 
fold cross validation. To tune the thresholds and select the base learner, we again 
split the training fold of the outer loop into train and validation sets by 2-fold 
cross validation. The entire process is repeated 5 times with different random 
seeds for the cross validation. Throughout this study, we trained and evaluated 
a total of 14,400 instances of BR and 649,800 base learners accordingly. 

Furthermore, we consider two performance measures, namely the Hamming 
loss Ly and the macro-averaged label-wise F-measure as defined in (1) and (2), 
respectively. A binary prediction is obtained by thresholding the prediction of 
an underlying scoring classifier, which produces values in the unit interval (the 
higher the value, the more likely a label is considered relevant). The thresholds 
T = (71,72,---,;Tm) are optimized by a grid search considering values for 7; € 
(0, 1] and a step size of 0.01. When optimizing the thresholds, we either allow for 
label-wise optimization or constrain the threshold to be the same for all labels 
(uniform T), i.e., 7 = Tj for all i, j € {1,...,m}. 

In order to determine significance of results, we apply a Wilcoxon signed rank 
test with a threshold for the p-value of 0.05. Significant improvements of LiBRe 
are marked by e and significant degradations by o. 

We executed the single BR evaluation runs, i.e., training and evaluating either 
on the validation or test split, on up to 300 nodes in parallel, each of them 
equipped with 8 CPU cores and 32GB of RAM, and a timeout of 6h. Due to 
the limitation of the memory and the runtime, some of the evaluations failed 
due to memory overflows or timeouts. 

The implementation is based on the Java machine learning library WEKA [8] 
and an extension for multi-label classification called MEK A [16]. In our study, we 
consider a total of 20 base learners from WEKA: BayesNet (BN), DecisionStump 
(DS), IBk, J48, JRip (JR), KStar (KS), LMT, Logistic (L), MultilayerPerceptron 
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Table 1. The datasets used in this study. Furthermore, the number of instances (#1), 
the number of labels (#L), the label-to-instance ratio (L2IR), the percentage of unique 
label combinations (ULC), and the label cardinality (card.) are given. 


Dataset #1 #L |L2IR | ULC|card. | Dataset #1 #L |L2IR | ULC |card. 
arts1 7484 | 26 |0.0035|0.08 |1.65 |bibtex 7395 |159 |0.0215 | 0.39 |2.40 
birds 645 19 |0.0295|0.21 |1.01 | bookmarks 87856 |208 |0.0024/0.21 |2.03 
business1 |11214 | 30 |0.0027|0.02 |1.60 |computers1 12444 | 33 |0.0027 | 0.03 |1.51 
education1 | 12030 | 33 |0.0027|0.04 |1.46 | emotions 593 6 |0.0101|0.05 |1.87 
enron-f 1702 | 53 |0.0311|0.44 |3.38 |entertainment1 | 12730 | 21 |0.0016 | 0.03 |1.41 
flags 194 | 12 |0.0619|0.53 |4.12 | genbase 662 | 27 | 0.0408 |0.05 |1.25 
health1 9205 | 32 |0.0035|0.04 |1.64 |llog-f 1460 | 75 |0.0514|0.21 |1.18 
mediamill |43907 |101 |0.0023|0.15 |4.38 | medical 978 | 45 |0.0460|0.10 |1.25 
recreation1 | 12828 | 22 |0.0017|0.04 | 1.43 | referencel 8027 | 33 | 0.0041}|0.03 | 1.17 
scene 2407 6 |0.0025/0.01 | 1.07 | sciencel 6428 | 40 |0.0062|0.07 | 1.45 
sociall 12111 39 |0.0032/0.03 | 1.28 | societyl 14512 | 27 |0.0019|0.07 | 1.67 
tmc2007 28596 | 22 |0.0008/0.05 | 2.16 | yeast 2417 | 14 |0.0058) 0.08 | 4.24 


(MIP), NaiveBayes (NB), NaiveBayesMultinomial (NBM), OneR (1R), PART 
(P), REPTree (REP), RandomForest (RF), RandomTree (RT), SMO, SimpleL- 
ogistic (SL), VotedPerceptron (VP), ZeroR (OR). All the data and source code 
is made available via GitHub (https://github.com/mwever/LiBRe). 


5.2 Single Best Base Learner 


To figure out how much we can benefit from selecting a base learner for each label 
individually, and whether this flexibility is beneficial at all, we define the single 
best base learner, subsequently referred to as SBB, as a baseline. In principle, 
SBB is nothing but a grid search over the portfolio of base learners (3). 

When considering a base learner a, it is chosen to be employed as a base 
learner for every label. After training and validating the performance, we pick 
the base learner that performs best overall. This baseline thus gives an upper 
bound on the performance of what can be achieved when the base learner is 
not chosen for each label individually. As simple and straight-forward as it is, 
this baseline represents what is currently possible in implementations of MLC 
libraries, and already goes beyond what is most commonly done in the literature. 


5.3 Optimistic Versus Validated Optimization 


In addition to the results obtained by selecting the base learner(s) according 
to the validation performance (obtained in the inner loop of the nested cross 
validation), we consider optimistic performance estimates, which are obtained 
as follows: After having trained the base learners on the training data, we select 
the presumably best one, not on the basis of their performance on validation 
data, but based on their actual test performance (as observed in the outer loop 
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Fig. 1. The heat map shows the average share of each base learner being employed 
for a label with respect to the optimized performance measure: Hamming (Ly) or the 
label-wise macro averaged F-measure (F). 


of the nested cross-validation). Intuitively, this can be understood as a kind of 
“oracle” performance: Given a set of candidate predictors to choose from, the 
oracle anticipates which of them will perform best on the test data. 

Although these performances should be treated with caution, and will cer- 
tainly tend to overestimate the true generalization performance of a classifier, 
they can give some information about the potential of the optimization. More 
specifically, these optimistic performance estimates suggest an upper bound on 
what can be obtained by the nested optimization routine. 


5.4 Results 


In Fig.1, the average share of a base learner per label is shown. From this 
heatmap, it becomes obvious that for the SBB baseline only a subset of base 
learners plays a role. However, one can also notice that the distribution of the 
shares varies when different performance measures are optimized. Furthermore, 
although random forest (RF) achieves significant shares of 0.8 for the Hamming 
loss and around 0.6 for the F-measure, it is not best on all the datasets. To put 
it differently, one still needs to optimize the base learner per dataset. This is 
especially true, when different performance measures are of interest. 

In the case of LiBRe, it is clearly recognizable how the shares are distributed 
over the base learners, in contrast to SBB. For example, the shares of RF decrease 
to 0.29 for F-measure and to 0.25 for Hamming, respectively. Moreover, base 
learners that did not even play any role in SBB are now gaining in importance 
and are selected quite often. Although there are significant differences in the 
frequency of base learners being picked, there is not a single base learner in the 
portfolio that was never selected. 

In Table 2, the results for optimizing Hamming loss are presented. The opti- 
mistic performance estimates already indicate that there is not much room for 
improvement. This comes at no surprise, since the datasets are already pretty 
much saturated, i.e., the loss is already close to 0 for most of the datasets. While 
LiBRe performs competitively to SBB for the setting with uniform 7, SBB com- 
pares favourably to LiBRe in the case where the thresholds can be tuned in a 
label-wise manner. Apparently, the additional degrees of freedom make LiBRe 
more prone to over-fitting, especially on smaller datasets. 

In contrast to the previous results, for the optimization of the F-measure, 
the optimistic performance estimates already give a promising outlook on the 
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Table 2. Results obtained for minimizing £y optimistically resp. with validation per- 
formances. Thresholds are optimized either jointly for all the labels (uniform 7) or 
label-wise. Best performances per setting and dataset are highlighted in bold. Signifi- 
cant improvements of LiBRe are marked by a e and degradations by o. 


Dataset Optimistic uniform 7|Validated uniform 7|Optimistic label-wise T|Validated label-wise 7 
LiBRe |SBB LiBRe |SBB LiBRe |SBB LiBRe |SBB 

arts1 0.0515 0.0536 0.0531/0.0538 0.0504/0.0513 0.0526 |0.0525 
bibtex 0.0118 0.0126 0.0126/0.0127 0.0115/0.0120 0.0151 |0.0139 
birds 0.0357 0.0397 0.0476 |0.0420 o 0.0329/0.0352 0.0470 |0.0422 o 
bookmarks 0.0085 0.0087 0.0086/0.0087 eè 0.0085/0.0086 0.0105/0.0114 e 
business1l 0.0233 0.0248 0.0241/0.0249 è 0.0218/0.0223 0.0227 /0.0228 
computersl 0.0313 0.0334 0.0329/0.0335 0.0301/0.0306 0.0323 |0.0312 
education1l 0.0352 0.0365 0.0359/0.0369 e 0.0340/0.0344 0.0354 |0.0349 o 
emotions 0.1762 0.1800 0.1926 |0.1856 o 0.1684/0.1712 0.1961 |0.1875 o 
enron-f 0.0447 0.0474 0.0481 |0.0477 0.0437/0.0445 0.0485 |0.0469 o 
entertainment1/0.0432/0.0466 0.0440/0.0469 e 0.0414/0.0434 0.0430/0.0443 e 
flags 0.1732 0.1979 0.2134 |0.2088 0.1635/0.1799 0.2105/0.2158 
genbase 7.0E-4 0.0014 0.0069 |0.0016 o 6.0E-4)|7.0E-4 0.0070 |0.0023 o 
health1 0.0305 |0.0344 0.0313/0.0347 eè 0.0282/0.0297 0.0303 |0.0302 
llog-f 0.0149 0.0153 0.0202 |0.0157 o 0.0145/0.0149 0.0230 |0.0178 o 
mediamill 0.0268 0.0270 0.0271 |0.0270 0.0261/0.0262 0.0265 |0.0265 
medical 0.0084 0.0103 0.0115 |0.0109 0.0078/0.0093 0.0136 |0.0116 
recreation] 0.0459 0.0472 0.0472/0.0473 0.0446/0.0453 0.0468 |0.0462 
referencel 0.0244 0.0264 0.0267 /0.0268 0.0230/0.0245 0.0255 |0.0251 
scene 0.0781 0.0788 0.0817 |0.0794 o 0.0757/0.0762 0.0816 |0.0800 o 
sciencel 0.0281 0.0311 0.0311/0.0317 0.0269/0.0291 0.0304 |0.0302 
sociall 0.0197 0.0208 0.0227 |0.0210 0.0188/0.0196 0.0223 |0.0200 
society1 0.0474 0.0495 0.0479/0.0496 è 0.0444/0.0455 0.0455/0.0461 è 
tmc2007 0.0601 0.0611 0.0600/0.0611 è 0.0590/0.0611 0.0613 |0.0611 
yeast 0.1914 0.1926 0.2002 |0.1930 o 0.1886/0.1890 0.1940 |0.1929 o 


potential for improving the generalization performance through the label-wise 
selection of the base learners. More precisely, they indicate that performance 
gains of up to 11% points are possible. Independent of the threshold optimization 
variant, LiBRe outperforms the SBB baseline, yielding the best performance on 
two third of the considered datasets, 13 improvements of which are significant in 
the case of uniform 7, and 11 in the case of label-wise 7. Significant degradations 
of LiBRe compared to SBB can only be observed for 2 respectively 3 datasets. 
Hence, for the F-measure, LiBRe compares favorably to the SBB baseline. 

In summary, we conclude that LiBRe does indeed yield performance improve- 
ments. However, increasing the flexibility of BR also makes it more prone to 
over-fitting. Furthermore, these results were obtained by conducting a nested 
2-fold cross validation. While keeping the computational costs of this evaluation 
reasonable, this implies that, for the purpose of validation, the base learners were 
trained on only one fourth of the original dataset. Therefore, considering nested 
5-fold or 10-fold cross validation could help to reduce the observed over-fitting. 
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Table 3. Results for maximizing the F-measure optimistically resp. with validation 
performances. Thresholds are optimized either jointly for all the labels (uniform T) or 
label-wise. Best performances per setting and dataset are highlighted in bold. Signifi- 
cant improvements of LiBRe are marked by a e and degradations by o. 


Dataset Optimistic uniform 7|Validated uniform 7|Optimistic label-wise T|Validated label-wise 7 
LiBRe |SBB LiBRe |SBB LiBRe |SBB LiBRe |SBB 

arts1 0.3445 0.2749 0.3018/0.2684 e 0.3680/0.3211 0.3184/0.3001 eè 
bibtex 0.4020 0.3027 0.3391/0.2998 eè 0.4194/0.3516 0.3378/0.3041 è 
birds 0.5404 0.4424 0.3707 |0.3961 o 0.5832/0.5310 0.3843 |0.3981 o 
bookmarks 0.2495 0.2244 0.2347/0.2239 eè 0.2646/0.2516 0.2435 /0.2416 
business1 0.3692 0.2854 0.2970/0.2659 è 0.3874/0.3197 0.3006/0.2790 eè 
computersl 0.3646 0.2861 0.3099/0.2810 eè 0.3833/0.3486 0.3224/0.3190 
education1l 0.3346 0.2468 0.2594/0.2437 è 0.3591/0.3022 0.2652/0.2612 
emotions 0.7068 0.6946 0.6670 |0.6779 0.7186/0.7135 0.6761 |0.6859 o 
enron-f 0.2870 0.2192 0.2056 |0.2096 0.3138/0.2773 0.2077 /0.2069 
entertainment1/0.4470/0.3673 0.3929/0.3500 e 0.4639/0.4049 0.3950/0.3774 è 
flags 0.6280 0.5634 0.5230/0.5098 0.6474/0.5981 0.5150/0.5145 
genbase 0.8126 0.7798 0.6039 |0.7421 o 0.8141/0.8119 0.6201 |0.6390 
health1 0.4203 0.3259 0.3486/0.3208 e 0.4312/0.3582 0.3464/0.3225 è 
llog-f 0.1569 0.0808 0.0730/0.0689 0.1834/0.1264 0.0744/0.0741 
mediamill 0.3766 0.3499 0.3481 |0.3483 0.4010/0.3898 0.3543 |0.3600 o 
medical 0.4960 0.3852 0.3560 |0.3639 0.5251/0.4523 0.3547/0.3208 eè 
recreationl 0.4964 0.4224 0.4669/0.4160 e 0.5093/0.4675 0.4670/0.4494 è 
referencel 0.3185 0.2254 0.2477/0.2021 è 0.3393/0.2860 0.2587/0.2418 è 
scene 0.7831 0.7816 0.7734 |0.7776 0.7909/0.7897 0.7759 |0.7812 
sciencel 0.3824 0.2724 0.2928/0.2637 eè 0.4033/0.3240 0.3036/0.2662 è 
sociall 0.3629 0.3073 0.3046 |0.3060 0.3737/0.3119 0.3103/0.2769 e 
societyl 0.3437 0.2807 0.3180/0.2688 e 0.3597 /0.3382 0.3215 |0.3238 
tmc2007 0.5659 0.5342 0.5467 /|0.5342 0.5782/0.5525 0.5656/0.5484 e 
yeast 0.4970 0.4750 0.4800/0.4731 è 0.5145/0.5084 0.4922 |0.4947 


6 Conclusion 


In this paper, we have not only demonstrated the potential of binary relevance 
to optimize label-wise macro averaged measures, but also the importance of 
the base learner as a hyper-parameter for each label. Especially for the case of 
optimizing for F1 macro-averaged over the labels, we could achieve significant 
performance improvements by choosing a proper base learner in a label-wise 
manner. Compared to selecting the best single base learner, choosing the base 
learner for each label individually comes at no additional cost in terms of base 
learner evaluations. Moreover, the label-wise selection of base learners can be 
realized by a straight-forward grid search. 

As the label-wise choice of a base learner has already led to considerable 
performance gains, we plan to examine to what extent the optimization of the 
hyper-parameters of those base learners can lead to further improvements. Fur- 
thermore, we want to increase the efficiency of the tuning by replacing the grid 
search with a heuristic approach. 
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Another direction of future work concerns the avoidance of over-fitting effects 
due to an overly excessive flexibility of LiBRe. As already explained, the restric- 
tion to a single base learner can be seen as a kind of regularization, which, how- 
ever, appears to be too strong, at least according to our results. On the other 
side, the full flexibility of LiBRe does not always pay off either. An interesting 
compromise could be to restrict the number of different base learners used by 
LiBRe to a suitable value k € {1,...,m}. Technically, this comes down to finding 
the arg min in (4), not over a € A”, but over {a € A” | #{a1,..., am} < kh. 
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Abstract. Many-objective optimization, which deals with an optimiza- 
tion problem with more than three objectives, poses a big challenge to 
various search techniques, including evolutionary algorithms. Recently, 
a meta-objective optimization approach (called bi-goal evolution, BiGE) 
which maps solutions from the original high-dimensional objective space 
into a bi-goal space of proximity and crowding degree has received 
increasing attention in the area. However, it has been found that BiGE 
tends to struggle on a class of many-objective problems where the search 
process involves dominance resistant solutions, namely, those solutions 
with an extremely poor value in at least one of the objectives but with 
(near) optimal values in some of the others. It is difficult for BiGE to get 
rid of dominance resistant solutions as they are Pareto nondominated and 
far away from the main population, thus always having a good crowd- 
ing degree. In this paper, we propose an angle-based crowding degree 
estimation method for BiGE (denoted as aBiGE) to replace distance- 
based crowding degree estimation in BiGE. Experimental studies show 
the effectiveness of this replacement. 


Keywords: Many-objective optimization - Evolutionary algorithm - 
Bi-goal evolution - Angle-based crowding degree estimation 


1 Introduction 


Many-objective optimization problems (MaOPs) refer to the optimization of 
four or more conflicting criteria or objectives at the same time. MaOPs exist 
in many fields, such as environmental engineering, software engineering, control 
engineering, industry, and finance. For example, when assessing the performance 
of a machine learning algorithm, one may need to take into account not only 
accuracy but also some other criteria such as efficiency, misclassification cost, 


interpretability, and security. 


There is often no one best solution for an MaOP since the performance 
increase in one objective will lead to a decrease in some other objectives. 
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In the past three decades, multi-objective evolutionary algorithms (MOEAs) 
have been successfully applied in many real-world optimization problems with 
low-dimensional search space (two or three conflicting objectives) to search for 
a set of trade-off solutions. 

The major purpose of MOEAs is to provide a population (a set of opti- 
mal individuals or solutions) that balance proximity (converging a population 
to the Pareto front) and diversity(diversifying a population over the whole 
Pareto front). By considering the two goals above, traditional MOEAs, such 
as SPEA2 [13] and NSGA-II [1] mainly focus on the use of Pareto dominance 
relations between solutions and the design of diversity control mechanisms. 

However, compared with a low-dimensional optimization problem, well- 
known Pareto-based evolutionary algorithms lose their efficiency in solving 
MaOPs. In MaOPs, most solutions in a population become equally good solu- 
tions, since the Pareto dominance selection criterion fails to distinguish between 
solutions and drive the population towards the Pareto front. Then the density 
criterion is activated to guide the search, resulting in a substantial reduction of 
the convergence of the population and the slowdown of the evolution process. 
This is termed the active diversity promotion (ADP) phenomenon in [11]. 

Some studies [6] observed that the main reason for ADP phenomenon is the 
preference of dominance resistant solutions (DRSs). DRSs refer to those solutions 
that are extremely inferior to others in at least one objective but have near- 
optimal values in some others. They are considered as Pareto-optimal solutions 
despite having very poor performance in terms of proximity. As a result, Pareto- 
based evolutionary algorithms could search a population that is widely covered 
but far away from the true Pareto front. 

To address the difficulties of MOEAs in high-dimensional search space, one 
approach is to modify the Pareto dominance relation. Some powerful algorithms 
in this category include: ~MOEA [2] and fuzzy Pareto dominance [5]. These 
methods work well under certain circumstances but they often involve extra 
parameters and the performance of these algorithms often depends on the setting 
of parameters. The other approach, without considering Pareto dominance rela- 
tion, may be classified into two categories: aggregation-based algorithms [15] and 
indicator-based algorithms [14]. These algorithms have been successfully applied 
to some applications, however, the diversity performance of these aggregation- 
based algorithms depends on the distribution of weight vectors. The latter defines 
specific performance indicators to guide the search. 

Recently, a meta-objective optimization algorithm, called Bi-Goal Evolution 
(BiGE) [8] for MaOPs is proposed and becomes the most cited paper published 
in the Artificial Intelligence journal over the past four years. BiGE was inspired 
by two observations in many-objective optimization: (1) the conflict between 
proximity and diversity requirement is aggravated when increasing the number 
of objectives and (2) the Pareto dominance relation is not effective in solving 
MaOPs. In BiGE, two indicators were used to estimate the proximity and crowd- 
ing degree of solutions in the population, respectively. By doing so, BiGE maps 
solutions from the original objective space to a bi-goal objective space and deals 


576 Y. Xue et al. 


with the two goals by the nondominated sorting. This is able to provide suffi- 
cient selection pressure towards the Pareto front, regardless of the number of 
objectives that the optimization problem has. 

However, despite its attractive features, it has been found that BiGE tends to 
struggle on a class of many-objective problems where the search process involves 
DRSs. DRSs are far away from the main population and always ranked as good 
solutions by BiGE, thus hindering the evolutionary progress of the population. 
To address this issue, this paper proposes an angle-based crowding degree esti- 
mation method for BiGE (denoted as aBiGE). The rest of the paper is organized 
as follows. Section 2 gives some concepts and terminology about many-objective 
optimization. In Sect. 3, we present our angle-based crowding degree estimation 
method and its incorporation with BiGE. The experimental results are detailed 
in Sect. 4. Finally, the conclusions and future work are set out in Sect. 5. 


2 Concepts and Terminology 


When dealing with optimization problems in the real world, sometimes it may 
involve more than three performance criteria to determine how “good” a certain 
solution is. These criteria, termed as objectives (e.g., cost, safety, efficiency) need 
to be optimized simultaneously, but usually conflict with each other. This type of 
problem is called many-objective optimization problem (MaOP). A minimization 
MaOP can be mathematically defined as follows: 


minimize P(e) = (fil), fal), fir (2)) 
subject to gj(z) <0, f= 1,2,..,7 
hy(z) = 0, &=1,2,..,K 


£ = (£1, Z2, EM), TEN 


(1) 


where x denotes an M-dimensional decision variable vector from the feasible 
region in the decision space 92, F(x) represents an N-dimensional objective vec- 
tor (N is larger than three), f;(a) is the i-th objective to be minimized, objec- 
tive functions f1, fo,..., fy constitute N-dimensional space called the objective 
space, gj(x) < 0 and hy(x) = 0 define J inequality and K equality constraints, 
respectively. 


Definition 1 (Pareto Dominance). Given two decision vectors x,y € R of 
a minimization problem, x is said to (Pareto) dominate y (denoted as x < y), 
or equivalently y is dominated by x, if and only if [4] 


Vie (1,2,...,N): fiz) < filly) A Ji € (1,2,...,N): fil) < fily). 2) 


Namely, given two solutions, one solution is said to dominate the other solu- 
tion if it is at least as good as the other solution in any objective and is strictly 
better in at least one objective. 
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Definition 2 (Pareto Optimality). A solution x € N is said to be Pareto 
optimal if and only if there is no solution y E€ Q dominates it. Those solutions 
that are not dominated by any other solutions is said to be Pareto-optimal (or 
non-dominated). 


Definition 3 (Pareto Set). All Pareto-optimal (or non-dominated) solutions 
in the decision space constitute the Pareto set (PS). 


Definition 4 (Pareto Front). The Pareto front (PF) is referred to corre- 
sponding objective vectors to a Pareto set. 


Definition 5 (Dominance Resistant Solution). Given a solution set, dom- 
inance resistant solution (DRS) is referred to the solution with an extremely poor 
value in at least one objective, but with near-optimal value in some other objective. 


3 The Proposed Algorithm: aBiGE 


3.1 A Brief Review of BiGE 


Algorithm 1 shows the basic framework of BiGE. First, a parent population with 
M solutions is randomly initialized. Second, proximity and crowding degree for 
each solution is estimated, respectively. Third, in the mating selection, individ- 
uals that have better quality with regards to the proximity and crowding degree 
tend to become parents of the next generation. Afterward, variation operators 
(e.g., crossover and mutation) are applied to these parents to produce an off- 
spring population. Finally, the environmental selection is applied to reduce the 
expanded population of parents and offspring to M individuals as the new parent 
population of the next generation. 


Algorithm 1. Basic Framework of BiGE 
Require: P (current population), M (population size) 
1: P = Initialization(P) 

2: while termination criterion not fulfilled do 


3: Proximity_E'stimation(P) 

4: Crowding_Degree_Estimation(P) 
5: P' = Mating_Selection(P) 

6: P” = Variation(P’) 

7 Q=P'UP" 

8: P = Environmental_Selection(Q) 
9: end while 


10: return P 
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In particular, a simple aggregation function is adopted to estimate the prox- 
imity of an individual. For an individual z in a population, denoted as f,(x), its 
aggregation value is calculated by the sum of each normalized objective value in 
the range [0, 1] (lines 3 in Algorithm 1), formulated as [8]: 


N ~ 
fot) = >> fi(2). (3) 


where f(a) denotes the normalized objective value of individual æ in the j-th 
objective, and N is the number of objectives. A smaller fp value of an individual 
usually indicates a good performance on proximity. In particular, for a DRS, it 
is more likely to obtain a significantly large fp value in comparison with other 
individuals in a population. 

In addition, the crowding degree of an individual x (lines 4 in Algorithm 1) 
is defined as follows [8]: 


felt) =( A shay). (4) 

yeQ,wAy 
where sh(x, y))!/? denotes a sharing function. It is a penalized Euclidean distance 
between two individuals x and y by using a weight parameter, defined as follows: 


(0.50 — “2Y))?, if d(x,y) < r, fo(@) < foly) 


shia) = | CBO- E2), if d(e, y) <r, f(z) > foly) 
MEN a if d(x, y) <1, o 6) 
0, otherwise 


where r is the radius of a niche, adaptively calculated by r = 1/ VM (M is the 
population size and N is the number of objectives). The function rand() means to 
assign either sh(x, y) = (0.5(1—[d(z, y)/r]))? and sh(y, x) =(1.5(1—[d(a, y)/r]))? 
or sh(x, y)=(1.5(1—[d(z, y)/r]))? and sh(y, x) =(0.5(1—[d(z, y)/r]))? randomly. 
Individuals with lower crowding degree imply better performance on diversity. 

It is observed that BiGE tends to struggle on aclass of MaOPs where the search 
process involves DRSs, such as DTLZ1 and DTLZ3 (in a well-known benchmark 
test suite DTLZ [3]). Figure 1 shows the true Pareto front of the eight-objective 
DTLZ1 and the final solution set of BiGE in one typical run on the eight-objective 
DTLZ1 by parallel coordinates. The parallel coordinates map the original many- 
objective solution set to a 2D parallel coordinates plane. Particularly, Li et al. in 
[9] systematically explained how to read many-objective solution sets in parallel 
coordinates, and indicates that parallel coordinates can partly reflect the quality 
of a solution set in terms of convergence, coverage, and uniformity. 

Clearly, there are some solutions that are far away from the Pareto front in 
BiGE, with the solution set of eight-objective DTLZ1 ranging from 0 to around 
450 compared to the Pareto front ranging from 0 to 0.5 on each objective. Such 
solutions always have a poor proximity degree and a good crowding degree (esti- 
mated by Euclidean distance)in bi-goal objective space (i.e., convergence and 
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Fig. 1. The true Pareto front and the final solution set of BiGE on the eight-objective 
DTLZ1, shown by parallel coordinates. 


diversity), and will be preferred since there is no solution in the population that 
dominates them in BiGE. These solutions are detrimental for BiGE to converge 
the population to the Pareto front considering their poor performance in terms 
of convergence. A straightforward method to remove DRSs is to change the 
crowding degree estimation method. 


3.2 Basic Idea 


The basic idea of the proposed method is based on some observations of DRSs. 
Figure 2 shows one typical situation of a non-dominated set with five individuals 
including two DRSs (i.e, A and E) in a two-dimensional objective minimization 
scenario. 

As seen, it is difficult to find a solution that could dominate DRSs by esti- 
mating the crowding degree using Euclidean distance. Take individual A as an 
example, it performs well on objective fı (slightly better than B with a near- 
optimal value 0) but inferior to all the other solutions on objective fə. It is 
difficult to find a solution with better value than A on objective fı, same as 
individual Æ on objective fg. A and E (with poor proximity degree and good 
crowding degree) are considered as good solutions and have a high possibility 
to survive in the next generation in BiGE. However, the results would be dif- 
ferent if the distance-based crowding degree estimation is replaced by a vector 
angle. It can be observed that (1) an individual in a crowded area would have 
a smaller vector angle to its neighbor compared to the individual in a sparse 
area, e.g., C and D, (2) a DRS would have an extremely small value of vector 
angle to its neighbor, e.g., the angle between A and B or the angle between E 
and D. Namely, these DRSs would be assigned both poor proximity and crowd- 
ing degrees, and have a high possibility to be deleted during the evolutionary 
process. Therefore, vector angles have the advantage to distinguish DRSs in the 
population and could be considered into crowding degree estimation. 
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h 4 
^ 


10 


Fig. 2. An illustration of a population of five solutions with two DRSs - A and E. They 
have good crowding degrees estimated by the Euclidean distance, but poor crowding 
degrees calculated by the vector angle between two neighbors. 


3.3 Angle-Based Crowding Degree Estimation 


Inspired by the work in [12], we propose a novel angle-based crowding degree esti- 
mation method, and integrate it into the BiGE framework (line 4 in Algorithm 1), 
called aBiGE. Before estimating the diversity of an individual in a population in 
aBiGE, we first introduce some basic definitions. 


Norm. For individual x;, its norm, denoted as norm(a;) in the normalzied 
objective space defined as [12]: 


Vector Angles. The vector angle between two individuals x; and xz is defined 
as follows [12]: 


F'(a) è F"(ax) 


norm(x;) -norm(ax) 


(7) 


angle x, x, = arccos 
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where F'(x;) e F’(a,,) is the inner product between F”(z;) and F’(x;) defined 
as: 


N 
F' (xi) e F' (2r) = X Jæ) - Filan) (8) 


Note that angle s;>x, € [0, 7/2]. 

The vector angle from an individual x; € 2 to the population is defined as 
the minimum vector angle between x; and another individual in a population 
P: 0(x;) = angle z; p 

When an individual x is selected into archive in the environmental selection, 
respectively, 9(x) value will be punished. There are several factors need to be 
considered in order to achieve a good balance between proximity and diversity. 


— A severe penalty should be imposed on individuals that have more adjacent 
individuals in a niche. Inspired by the punishment method of crowding degree 
estimation, a punishment to an individual x is based on the number of indi- 
viduals that have a lower proximity degree compared to x is counted (denote 
as c). The punishment is aggravated with an increase of c. 

— In order to avoid the situation that some individuals have the same vector 
angle value to the population, individuals should be further punished. There- 
fore, the penalty is implemented according to the proportion value of 6(2) to 
all the individuals in the niche, denoted as p. 


Keep the above factors in mind, in aBiGE, the diversity estimation of individual 
x € 2 based on vector angles is defined as 
c+1 
fala) = T` (9) 
TE 


By applying the angle-based crowding degree estimation method to BiGE 
framework in minimizing many-objective optimization problems, we aim to 
enhance the selection pressure on those non-dominated solutions in the pop- 
ulation of each generation and avoid the negative influence of DRSs in the opti- 
mization process. Note that, a smaller value of fa(x) is preferred. 


4 Experiments 


4.1 Experimental Design 


To test the performance of the proposed aBiGE on those MaOPs where the 
search process involves DRSs, the experiments are conducted on nine DTLZ 
test problems. For each test problem (i.e., DTLZ1, DTLZ3, and DTLZ7), five, 
eight, and ten objectives will be considered, respectively. 

To make a fair comparison with the state-of-the-art BIGE for MaOPs, we 
kept the same settings as [8]. Settings for both BiGE and aBiGE are: 


— The population size of both algorithms is set to 100 for all test problems. 
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— 30 runs for each algorithm per test problem to decrease the impact of their 
stochastic nature. 

— The termination criterion of a run is a predefined maximum of 30,000 eval- 
uations, namely 300 generations for test problems. 

— For crossover and mutation operators, crossover and mutation probability are 
set to 1.0 and 1/M (where M represents the number of decision variables) 
respectively. In particular, uniform crossover and polynomial mutation are 
used. 


Algorithms performance is assessed by performance indicators that consider 
both proximity and diversity. In this paper, a modified version of the original 
inverted generational distance indicator (IGD) [15], called (IGD+) [7] is chosen 
as the performance indicator. Although IGD has been widely used to evaluate 
the performance of MOEAs on MaOPs, it has been shown [10] that IGD needs 
to be replaced by IGD+ to make it compatible with Pareto dominance. IGD+ 
evaluates a solution set in terms of both convergence and diversity, and a smaller 
value indicates better quality. 


4.2 Performance Comparison 


Test Problems with DRSs. Table 1 shows the mean and standard deviation 
of IGD+ metric results on nine DTLZ test problems with DRSs. For each test 
problem, among different algorithms, the algorithm that has the best result 
based on the IGD+ metric is shown in bold. As can be seen from the table, 
for MaOPs with DRSs, the proposed aBiGE performs significantly better than 
BiGE on all test problems in terms of convergence and diversity. 


Table 1. Mean and standard deviation of IGD+ metric on nine DTLZ test problems. 
The best result for each test problem is highlighted in boldface. 


Problem | Obj. | BiGE aBiGE 

DTLZ1 | 5 | 8.4207E—01 (3.59E—01) | 1.1768E—01 
8 | 1.9350E+00 (1.27E+00) | 1.9495E—01 
10 |1.9653E+00 (1.36E+00) | 2.2763E—01 
DTLZ3 | 5 | 1.5705E+01 (5.87E+00) | 6.0008E+00 (3.50E+00 


( ) (3.41E—02) 
( ) ( ) 
( ) ( ) 
( ) ( ) 
8 | 3.3434E+01 (1.17E+01) | 9.6401E+00 (6.30E+00) 
( ) ( ) 
( ) ( ) 
( ) ( ) 
( ) ( ) 


9.44E—02 
9.57E—02 


10 | 3.5720E+01 (1.58E+01) | 1.2780E+01 (5.40E+00 
DTLZ7 | 5 |4.6666E—01 (1.52E—01) | 3.1701E—01 (6.48E—02 

8 | 3.0415E+00 (6.03E—01) | 2.6350E-+00 (8.59E—01 
10 | 5.6152E+00 (7.41E—01) | 4.0059E+00 (4.53E—01 
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To visualize the experimental results, Figs. 3 and 4 plot, by parallel coordi- 
nate, the final solutions of one run with respect to five-objective DTLZ1 and five- 
objective DTLZ7, respectively. This run is associated with the particular run with 
the closest results to the mean value of IGD+. As shown in Fig. 3(a), the approx- 
imation set obtained by BiGE has an inferior convergence on the five-objective 
DTLZ1, with the range of its solution set is between 0 and about 400 in contrast 
to the Pareto front ranging from 0 to 0.5 on each objective. From Fig. 3 (b), it can 
be observed that the obtained solution set of the proposed aBiGE converge to the 
Pareto front and only a few individuals do not converge. 


500 2 


400 


w 
[e] 
© 


Objective Value 
8 


Objective Value 


100 


Objective No. Objective No. 
(a) BiGE (b) aBiGE 


Fig. 3. The final solution sets of the two algorithms on the five-objective DTLZI1, 
shown by parallel coordinates. 


Objective Value 
Objective Value 


Objective No. Objective No. 
(a) BiGE (b) aBiGE 


Fig. 4. The final solution sets of the two algorithms on the five-objective DTLZ7, 
shown by parallel coordinates. 


For the solutions of the five-objective DTLZ7, the boundary of the first four 
objectives is in the range [0, 1], and the boundary of the last objective is in the 
range [3.49, 10] according to the formula of DTLZ7. As can be seen from (Fig. 4), 
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all solutions of the proposed aBiGE appear to converge into the Pareto front. 
In contrast, some solutions (with objective value beyond the upper boundary in 
5th objective) of BiGE fail to reach the Pareto front. In addition, the solution 
set of the proposed aBiGE has better extensity than BiGE on the boundaries. 
In particular, the solution set of BiGE fails to cover the region from 3.49 to 6 
of the last objective and the solution set of the proposed aBiGE does not cover 
the range of Pareto front below 4 on 5th objective. 


Test Problem Without DRSs. Figure 5 gives the final solution set of both 
algorithms on the ten-objective DTLZ2 in order to visualize their distribution on 
the MaOPs without DRSs. As can be seen, the final solution sets of both algo- 
rithms could coverage the Pareto front with lower and upper boundary within 
[0,1] of each objective. Moreover, refer to [9], parallel coordinates in Fig. 5 partly 
reflect the diversity of solutions obtained by aBiGE is sightly worse than BiGE. 
This observation can be assessed by the IGD+ performance indicator where 
BiGE obtained a slightly lower (better) than the proposed aBiGE. 


Objective Value 
Objective Value 


1 2 3 4 E 6 7 8 9 410 
Objective No. Objective No. 
(a) BIGE (b) aBiGE 


Fig. 5. The final solution sets of BIGE and aBiGE on the ten-objective DTLZ2 and 
evaluated by IGD+ indicator, shown by parallel coordinates. (a) BiGE (IGD+ = 
2.4319E—01) (b) aBiGE (IGD+ = 2.5021E—01). 


5 Conclusion 


In this paper, we have addressed an issue of a well-established evolutionary many- 
objective optimization algorithm BiGE on the problems with high probability to 
produce dominance resistant solutions during the search process. We have pro- 
posed an angle-based crowding distance estimation method to replace distance- 
based estimation in BiGE, thus significantly reducing the effect of dominance 
resistant solutions to the algorithm. The effectiveness of the proposed method has 
been well evaluated on three representative problems with dominance resistant 
solutions. It is worth mentioning that for problems without dominance resistant 
solutions the proposed method performs slightly worse than the original BiGE. 
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In the near future, we would like to focus on the problems without dominance 
resistant solutions, aiming at a comprehensive improvement of the algorithm on 
both types of problems. 
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